Zero4D: Training-Free 4D Video Generation From Single Video
Using Off-the-Shelf Video Diffusion Model
Abstract
Recently, multi-view or 4D video generation has emerged as a significant research topic. Nonetheless, recent approaches to 4D generation still struggle with fundamental limitations, as they primarily rely on harnessing multiple video diffusion models with additional training or compute-intensive training of a full 4D diffusion model with limited real-world 4D data and large computational costs. To address these challenges, here we propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video. Our approach consists of two key steps: (1) By designating the edge frames in the spatio-temporal sampling grid as key frames, we first synthesize them using a video diffusion model, leveraging a depth-based warping technique for guidance. This approach ensures structural consistency across the generated frames, preserving spatial and temporal coherence. (2) We then interpolate the remaining frames using a video diffusion model, constructing a fully populated and temporally coherent sampling grid while preserving spatial and temporal consistency. Through this approach, we extend a single video into a multi-view video along novel camera trajectories while maintaining spatio-temporal consistency. Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.
![[Uncaptioned image]](/html/2503.22622v1/extracted/6318804/Images/teaser.jpg)
1 Introduction
Since the introduction of the diffusion and foundation models [9, 25, 40], 3D reconstruction has advanced significantly, leading to unprecedented progress in representing the real world in 3D models. Combined with generative models, this success drives a renaissance in 3D generation, enabling more diverse and realistic content creation. These advancements extend beyond static scene or object reconstruction and generation, evolving toward dynamic 3D reconstruction and generation that aims to capture the real world.
Prior works [4, 47, 27, 50, 2] leverage video diffusion models and Score Distillation Sampling (SDS) to enable dynamic 3D generation. However, most of these approaches primarily focus on generating dynamic objects in blank backgrounds (e.g., text-to-4D generation), leaving the challenge of reconstructing or generating real-world scenes from text condition or reference images or video. But, unlike the abundance of high-quality 3D and video datasets, extensive 4D datasets required for such 4D models remain scarce. Therefore, a fundamental challenge in training 4D generation models for real-world scenes is the lack of large-scale, multi-view synchronized video datasets.
To overcome these limitations, recent works such as 4DiM [37] proposes joint training diffusion model with 3D and video with scarce 4D dataset. CAT4D [39] proposes training multi-view video diffusion models by curating a diverse collection of synthetic 4D data, 3D datasets, and monocular video sources. DimensionX [28] trains the spatial-temporal diffusion model independently with multiple LoRA, achieved multi-view videos via an additional refinement process. Despite several approaches, the scarcity of high-quality 4D data makes it difficult to generalize to complex real-world scenes and poses fundamental challenges in training large multi-view video models.
To address these challenges, here we introduce a novel zero-shot framework called Zero4D, short for Zero-shot 4D video generator, which generates multi-view synchronized 4D video from single monocular video by leveraging off-the-shelf video diffusion model [5] without any training. Building upon the prior observations [31, 39] that 4D video is composed of multiple video frames arranged along the spatio-temporal sampling grid (i.e., camera view and time axes), generating a 4D video can be regarded as populating the sampling grid with consistent spatio-temporal frames. Consequently, our approach achieves this through two key steps: (1) We first designate the edge frames in the sampling grid as key frames and synthesize them. Specifically, we employ a depth-based warping technique as guidance in a video diffusion model, ensuring that the generated frames adhere to the underlying scene structure. (2) Inspired by ViBiDSampler [42], we then leverage the interpolation capabilities of a video model to fill in the remaining frames through bidirectional diffusion sampling, ensuring a fully populated and temporally coherent 4D grid. During these steps, our method imposes both spatial and temporal consistency across the grid. Our main contributions can be summarized as follows:
-
•
We propose a novel framework that can generate 4D video from a single video via an off-the-shelf video diffusion model without any training or large-scale datasets. To the best of our knowledge, our approach is the first training-free method to generate synchronized multi-view video.
-
•
We introduce a synchronization mechanism to ensure high-quality generation while preserving global spatio-temporal consistency. This is achieved through alternating bidirectional video interpolation sampling along both the camera and temporal axes, effectively aligning motion and appearance across frames.
-
•
Our approach is computationally efficient, enabling high-quality multi-view video generation with minimal memory consumption, making it significantly more accessible than previous methods.
2 Related Work

Dynamic 3D reconstruction. Numerous studies have explored methods for representing the dynamic nature of the real world in 3D models. DyNeRF [18] introduced 4D novel view synthesis using multi-view video inputs. Building on this foundation, subsequent research has focused on 4D reconstruction using posed multi-view inputs [38, 43, 12, 44]. However, all these methods require a large number of multi-view images with known camera parameters, which poses a significant challenge. With the success of Dust3R [35], which learns 3D representations in a self-supervised manner from unposed images without requiring camera parameters, many subsequent models have attempted to reconstruct real-world dynamic scenes based on this foundational approach. Monst3r [49], built on DUSt3R, enables the reconstruction of both dynamic point clouds and camera poses from unposed single-frame video inputs. Shape-of-Motion [33] attempts 4D reconstruction by tracking points from dynamic motion in single-video inputs without camera parameters. CUT3R [34] reconstructs dynamic scenes from single-video inputs by leveraging state updates of visual tokens within a ViT encoder.
Video generation with camera control.
Several studies try to train a multi-view diffusion model for spatially consistent image generation [26, 32, 20, 14, 7, 22]. Camco [41] fintunes pre-trained video diffusion model with injecting Plücker embedding vector into the specific layer in the model. ReCapture [48] trains the novel camera trajectory video diffusion model from a single reference video with existing scene motion. They leverage trainable multiple LoRA layers on a video model with a camera parameter label to regenerate the anchor video into a natural and temporally consistent novel view video. AC3D [3] enhances Video DiT [10] with ControlNet based camera conditioning, keeping the large frozen VDiT backbone for video synthesis while using lightweight modules to inject camera information. Cameractrl [8] proposes a plug-and-play camera module in the video diffusion model to control video generation with precise and smooth camera view points.
4D generation. Recent advancements in text-to-4D generation have been driven by numerous pioneering works exploring various conditioning methods. Among these, several approaches have leveraged score distillation sampling in conjunction with video diffusion models or multi-view image diffusion models to generate 4D content from text prompts [4, 47, 27, 50, 2]. However, these approaches largely focus on generating dynamic objects in blank backgrounds. Generating full 4D scenes under given constraints has recently emerged as an important research direction with limited prior work. A notable example is [39], which synthesizes 4D videos conditioned on multiple input modalities using a multi-view video model. They extend an image diffusion model into a multi-view video diffusion framework, trained on a carefully curated spatio-temporal dataset. To ensure consistency across viewpoints, they employ an alternating sampling strategy. Similarly, [30] introduces a framework for novel view synthesis of dynamic 4D scenes from a single video. This method is trained on synthetic multi-view video data with corresponding camera poses, enabling high-fidelity 4D reconstructions. Concurrently, [46] propose text-to-4D scene generation pipelines that integrate video diffusion models with canonical 3D Gaussian Splatting (3DGS) [17], ensuring spatio-temporal consistency in the generated 4D outputs. Furthermore, [31] enhances video diffusion models by introducing a parallel camera-temporal token stream and a learnable synchronization layer, which effectively fuses independent tokens to maintain camera and temporal consistency across generated frames.
3 Zero4D
Let denotes the image at the -th camera viewpoint and the -th temporal frame, where and denotes the height and width of the image, respectively (see Fig. 2(a)). Then, the input video captured from a single camera viewpoint is denoted as , whereas the multi-view images at the temporal frame is represented by . Then, The goal of Zero4D is to populate the spatio-temporal sampling grid (or camera-time grid) by generating frames across multiple camera poses.
As illustrated in Fig. 2, the overall reconstruction pipeline of Zero4D is composed of two steps: 1) key frame generation and 2) Spatio-temporal bidirectional interpolation along the time and camera axis in an alternating manner. In this section, we describe each in detail.
3.1 Key Frame Generation
As shown in Fig. 2(a), the key frame generation is achieved through three successive steps. Specifically, given a fixed-viewpoint grayscale input video denoted by , we first perform novel view synthesis, which is followed by end-view video frame generation. These two steps are achieved through diffusion sampling, guided by warped views. Finally, we complete the rightmost column using diffusion-based interpolation sampling.
Novel view synthesis (a2). At the first stage of key frame generation, we synthesize a novel view from the first frame using the I2V diffusion model. We incorporate the warped frames as guidance to ensure the generated novel views align with the warped images from input video warping.
The warped frames is computed as follows. Given an input video , we generate novel views by first estimating a per-frame depth map using a monocular depth estimation model [23]. This depth information enables depth-based geometric warping, where each frame of the input video is unprojected into 3D space and reprojected into a target viewpoint in where defines the desired set of camera views. This produces the warped frames:
(1) |
for , where is the intrinsic camera matrix. The warping function unprojects each pixel using its estimated depth and reprojects it into the target view. Formally, for each pixel location in the -view, the warped pixel location in the novel view at the -th camera location is computed as:
(2) |
where is the transformation from the input to the novel view, and is the depth at . Since may not align exactly with integer pixel locations, interpolation is applied to assign pixel values.
However, missing regions (e.g., occlusions from depth-based projection) often appear in . To address this, we utilize a video diffusion model parameterized by to inpaint the missing regions and ensure consistency within the 4D video grid. This can be considered as conditional sampling under the condition of the warped image, occlusion mask and the input video conditioning. For the case of novel view synthesis at the temporal frame index , this corresponds to
(3) |
where corresponds to the conditional distribution from the trained diffusion model, is an occlusion mask that identifies missing pixels, and provides the conditioning input from . The specific details of conditional video diffusion sampling will be described in Section 3.3.
End view video generation (a3). Similarly, we can synthesize the end view video from the generated view utilizing warp-guided diffusion sampling.
(4) |
This process follows the same video sampling approach as first-frame novel view synthesis; however, it differs in that it synthesizes the video from the final camera position.
End frame novel view synthesis (a4). Finally, we generate video at the end-frame novel view , which constitutes the rightmost column of the 4D grid in Fig. 2(a). Given that we already have from the input video and the synthesized end-view frame derived from , we incorporate both images to enhance consistency. To this end, we employ the idea of ViBiDSampler [42] - the state-of-the-art video interpolation method that allows for the simultaneous conditioning on both and . As described in Algorithm 1, the main innovation from ViBiDSampler is that we additionally incorporate the conditions from the warped image and its mask. Accordingly, we synthesize the last column using bidirectional video diffusion sampling:
(5) | ||||
where denotes the one-step bidirectional video interpolation using the extended version of ViBiDSampler. The final novel-view frame is obtained iteratively by applying over diffusion time steps
3.2 Spatio-Temporal Bidirectional Interpolation
As shown in Fig. 2(b), once the keyframes are generated, the remaining task is to fill in the missing sampling grid at the center so the final resulting 4D video remains consistent across both the camera and time axes. Accordingly, it is essential to perform conditioned sampling using the key frames and adjacent frames from the camera and temporal axes. However, a naive image-to-video diffusion model can only condition on a single or two end frames.
To address this challenge, inspired by ViBiDSampler [42], we propose a novel Spatio-Temporal Bidirectional Interpolation (STBI), which enables simultaneous conditioning on both the camera and time axes. The key idea is alternating the aforementioned bidirectional sampling along both camera and time axes so that the overall diffusion sampling trajectory is driven to satisfy multiple conditions from the key frames. More details follow.
Camera axis interpolation. Starting from the initial noise (line 9 in algorithm 2), we then select a specific frame in the 4D grid (a column) , and perform an interpolation denoising process using the edge-frame conditions and :
(6) |
In this process, the image condition is applied first, along with the warped view to guide the diffusion denoising step. The video is then perturbed with noise again, flipped along the camera axis, and subjected to another diffusion denoising step using as the condition. Through these two conditioning steps, integrates information from both and , enabling interpolation-based denoising that preserves consistency across the camera axis. Before proceeding with time axis interpolation, we apply a re-noising step, to ensure smooth transitions across generated frames.
Time axis interpolation. After ensuring spatial consistency across the camera axis, we interpolate frames along the time axis to maintain temporal coherence. For each row in the 4D grid, we perform an interpolation denoising step (7) using the start and end frame conditions and .
(7) |
Initially, is applied along with the warped view to guide the diffusion denoising step. The frame is then perturbed with noise, flipped along the time axis, and another diffusion denoising step is performed using as the condition. Through this bidirectional conditioning process, effectively integrates information from both and , facilitating interpolation-based denoising that ensures smooth transitions along the time axis.
Throughout the diffusion step, we perform denoising by alternating interpolation along the camera axis and time axis. This approach maintains global coherence while ensuring consistency in multi-view video generation. The overall sampling procedure of Zero4D is summarized in Table 2 in Appendix Section 6.
3.3 Details of Conditional Video Diffusion
In this work, we build upon Stable Video Diffusion (SVD) [5], an image-to-video diffusion model that follows the principles of the EDM framework [15]. SVD utilizes an iterative denoising approach based on an Euler step method, which progressively transforms a Gaussian noise sample into a clean signal :
(8) |
where the initial noise is , is the denoised estimate by Tweedie’s formula using the score function trained by the neural network parameterized by , and is the discretized noise level for each timestep .

Now, we describe how to modify SVD to enable conditional sampling under the condition on the warped image , the occlusion mask and the conditioning input . For convenience, we refer to as . From the formulation of the reverse diffusion sampling process in (8), The reverse diffusion process can be modulated by conditioning on a known scene-prior , as proposed in [21]:
(9) |
where is a mask that determines which parts of the scene are known, guiding the denoising process by preserving the known regions while allowing the diffusion model to inpaint the missing areas. In our approach, rather than relying on an externally defined scene-prior , we leverage the warped frames obtained from depth-based warping as the conditional guidance. Specifically, we redefine the denoising process by replacing with and substituting with the occlusion mask :
(10) |
Here, the occlusion mask ensures that the visible regions in directly guide the denoising process, while the unseen parts are inpainted using the learned prior. By incorporating this modified formulation into the reverse diffusion process, we obtain the following sampling update:
(11) |
where the target camera viewpoints influence the generated frames through the depth-warped observations , ensuring geometric consistency during video synthesis. Throughout the diffusion reverse sampling process, we iteratively apply this procedure. Additionally, following the approach described in [21, 19], we incorporate resampling annealing to further enhance the output quality.
4 Experiments
We used the SVD[5] as an I2V model for all experiments without any training. The image resolution was fixed at 576×1024, with 25 cameras and a sequence length of 25 frames, a total of multi-view video frames are 625=. All frames were generated to form a multi-view video following the target camera trajectory. For depth-based warping, we utilized off-the-shelf depth models [23, 11]. For depth-based warping, we utilized off-the-shelf depth models [23, 11] with a view variation of 15 degrees.

Method | FVD | FID | LPIPS | Subject Consistency | Background Consistency | Temporal Flickering | Motion Smoothness | Dynamic Degree | Image Quality | Aesthetic Quality |
SV4D [40] | 1502.4 | 200.74 | 0.4002 | 86.46% | 92.71% | 92.04% | 97.78% | 55.00% | 97.78% | 50.36% |
CGD [36] | 1584.4 | 238.17 | 0.4947 | 91.73% | 95.55% | 98.41% | 98.86% | 20.00% | 44.55% | 42.67% |
Ours | 896.29 | 135.08 | 0.3997 | 94.18% | 94.53% | 97.35% | 98.13% | 55.77% | 50.08% | 62.09% |
Baseline Models
We selected publicly available methods as baselines. SV4D [40] is an image-to-video (I2V) model that generates multiple novel-view videos from a single input, constructing a 4D image matrix of the object. It synthesizes target-view videos by applying consistent azimuthal displacement, producing 21-frame outputs at resolution. GCD [29] also takes a single video as input, allowing the selection of specific azimuth and elevation angles to generate novel views of dynamic 4D scenes. It produces 25-frame videos at resolution. All experiments were conducted on an RTX 4090 GPU (24GB VRAM). While both SV4D and GCD require extensive training, which is impractical on a single GPU, our approach eliminates the need for training, enabling efficient execution on a single RTX 4090.

Multi-View Video
To validate our method, we conducted experiments on 10 DAVIS [24] and 20 Pexel scenes. Novel views were synthesized by adjusting the target camera’s azimuth in 15-degree increments, following the evaluation protocol in [29]. For fair comparison, videos from our method and SV4D were resized to . To assess video quality, we utilized VBench [13], which evaluates seven key aspects of video generation, including subject identity retention, motion coherence, and temporal consistency. It incorporates advanced feature extractors such as DINO [6] for consistency and MUSIQ [16] for image quality, offering a detailed performance analysis. We further compared the ability of each baseline model to generate novel view videos given a single input video. As shown in Figure 4, our method demonstrates robust novel view synthesis, whereas SV4D and GCD exhibit reduced object consistency and distorted motion. This observation is further supported by the quantitative results in Table 1, where our method outperforms others in FVD, FID, LPIPS, and Subject Consistency, demonstrating superior overall video generation quality.
Bullet-Time and Camera Control
We evaluate the generated multi-view video by controlling the camera at fixed time points within the video. To further assess 3D consistency along the camera axis at fixed time, we utilize MEt3R [1], a state-of-the-art 3D reconstruction model. This measures 3D consistency in videos based on unposed images using DUSt3R[35] as its backbone. Currently, no open-source multi-view video generation model explicitly supports bullet-time generation. Thus, we conducted an ablation study to analyze its impact. To comprehensively evaluate bullet-time video quality, we employed FVD, FID, and LPIPS. By incorporating both spatio-temporal bidirectional interpolation (STBI) and warping, we achieved well-synchronized multi-view videos across both camera and time axes. As shown in Table 3, our method outperforms others across all key metrics. However, when STBI was not used, each frame underwent novel view synthesis independently. While this preserved per-frame 3D consistency, the absence of temporal coherence led to inferior FVD performance compared to our full method, which maintains global consistency across the entire video.
Method | FVD | FID | LPIPS | Subject Consistency | Background Consistency | Temporal Flickering | Motion Smoothness | Dynamic Degree | Image Quality | Aesthetic Quality |
Ours | 896.29 | 135.08 | 0.3997 | 94.18% | 94.53% | 97.35% | 98.13% | 55.77% | 50.08% | 62.09% |
w/o STBI | 1419.8 | 108.92 | 0.3797 | 92.57% | 93.59% | 94.49% | 94.49% | 100% | 52.93% | 67.57% |
w/o warp | 1830.5 | 205.08 | 0.4547 | 92.91% | 93.45% | 97.48% | 98.01% | 48.28% | 44.97% | 47.26% |
Method | FVD | FID | LPIPS | MEt3R |
Ours | 1470 | 106.31 | 0.3174 | 0.039 |
w/o STBI | 1607 | 108.21 | 0.3181 | 0.031 |
w/o WAP | 1923 | 179.99 | 0.3617 | 0.082 |
Ablation
We conducted comparative experiments under the following conditions: (1) Without warped frame guidance (see (10)). In this setting, we excluded key frame generation and removed warped frames from the input during camera-time interpolation. As shown in Table 2, conditioning the diffusion model solely on keyframes results in less distinct and less detailed images. This effect is quantitatively evident in the table, where the absence of warped frame guidance leads to degraded performance across key metrics. (2) Without spatio-temporal bidirectional interpolation (STBI). We also examined cases where STBI was omitted, generating each novel view independently per frame. Without global aggregation along the time axis, this approach fails to maintain multi-view coherence. In contrast, our method incorporates key frame generation to capture global scene information, enabling bidirectional frame synthesis. As reflected in Table 3, this results in better synchronization and improved temporal consistency, highlighting the role of STBI in preserving video coherence.
User Study
Method | View Angle | General Quality | Smoothness | BG Quality |
Ours | 67% | 70% | 60% | 68% |
GCD | 22% | 21% | 27% | 22% |
SV4D | 11% | 9% | 13% | 10% |
To evaluate our approach, we conducted a user study comparing Ours, GCD, and SV4D across four key metrics: View Angle, General Quality, Smoothness, and Background Quality. Participants viewed generated videos and selected the most visually appealing results. As shown in Table 4, our method consistently achieved the highest user preference, particularly in General Quality (70%) and Background Quality (68%), highlighting superior fidelity and scene preservation. View Angle (67%) results confirm precise novel view synthesis, while Smoothness (60%) indicates seamless transitions with minimal distortion.
5 Conclusion
In this work, we introduced a novel training-free approach for synchronized multi-view video generation using an off-the-shelf video diffusion model. Our method generates high-quality 4D video grids through depth-based warping and spatio-time bidirectional interpolation, ensuring structural consistency across spatial and temporal domains. Unlike prior methods requiring extensive training on large-scale 4D datasets, our framework achieves competitive performance without any training. Extensive experiments show that our approach generates synchronized multi-view videos with superior subject consistency, motion smoothness, and temporal stability. It outperforms baselines in aesthetic quality and imaging fidelity while remaining computationally efficient. Our framework offers a practical solution for multi-view video generation, especially when large-scale 4D datasets and computational resources are scarce. Future work may explore dynamic scenes or adaptive interpolation strategies to enhance novel view synthesis.
References
- Asim et al. [2024] Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images, 2024.
- Bahmani et al. [2024a] Sherwin Bahmani, Xian Liu, Wang Yifan, Ivan Skorokhodov, Victor Rong, Ziwei Liu, Xihui Liu, Jeong Joon Park, Sergey Tulyakov, Gordon Wetzstein, et al. Tc4d: Trajectory-conditioned text-to-4d generation. In European Conference on Computer Vision, pages 53–72. Springer, 2024a.
- Bahmani et al. [2024b] Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. arXiv preprint arXiv:2411.18673, 2024b.
- Bahmani et al. [2024c] Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024c.
- Blattmann et al. [2023] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Gao et al. [2024] Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024.
- He et al. [2024] Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024.
- Ho et al. [2020] Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020.
- Hong et al. [2022] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
- Hu et al. [2024] Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. arXiv preprint arXiv:2409.02095, 2024.
- Huang et al. [2023] Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. arXiv preprint arXiv:2312.14937, 2023.
- Huang et al. [2024] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024.
- Kant et al. [2024] Yash Kant, Aliaksandr Siarohin, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, and Igor Gilitschenski. Spad: Spatially aware multi-view diffusers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10026–10038, 2024.
- Karras et al. [2022] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35:26565–26577, 2022.
- Ke et al. [2021] Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021.
- Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023.
- Li et al. [2022] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5521–5531, 2022.
- Liu et al. [2024] Kunhao Liu, Ling Shao, and Shijian Lu. Novel view extrapolation with video diffusion priors. arXiv preprint arXiv:2411.14208, 2024.
- Liu et al. [2023] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023.
- Lugmayr et al. [2022] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022.
- Melas-Kyriazi et al. [2024] Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, and Filippos Kokkinos. Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation. arXiv preprint arXiv:2402.08682, 2024.
- Piccinelli et al. [2024] Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024.
- Pont-Tuset et al. [2017] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Rombach et al. [2021] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
- Shi et al. [2023] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
- Singer et al. [2023] Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280, 2023.
- Sun et al. [2024] Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, and Yikai Wang. Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928, 2024.
- Van Hoorick et al. [2024a] Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl Vondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. In European Conference on Computer Vision, pages 313–331. Springer, 2024a.
- Van Hoorick et al. [2024b] Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl Vondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. In European Conference on Computer Vision, pages 313–331. Springer, 2024b.
- Wang et al. [2024a] Chaoyang Wang, Peiye Zhuang, Tuan Duc Ngo, Willi Menapace, Aliaksandr Siarohin, Michael Vasilkovsky, Ivan Skorokhodov, Sergey Tulyakov, Peter Wonka, and Hsin-Ying Lee. 4real-video: Learning generalizable photo-realistic 4d video diffusion. arXiv preprint arXiv:2412.04462, 2024a.
- Wang and Shi [2023] Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023.
- Wang et al. [2024b] Qianqian Wang, Vickie Ye, Hang Gao, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruction from a single video. arXiv preprint arXiv:2407.13764, 2024b.
- Wang et al. [2025] Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state, 2025.
- Wang et al. [2024c] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024c.
- Wang et al. [2024d] Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. In ACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024d.
- Watson et al. [2024] Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasacchi, and David J Fleet. Controlling space and time with diffusion models. arXiv preprint arXiv:2407.07860, 2024.
- Wu et al. [2024a] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20310–20320, 2024a.
- Wu et al. [2024b] Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. arXiv preprint arXiv:2411.18613, 2024b.
- Xie et al. [2024] Yiming Xie, Chun-Han Yao, Vikram Voleti, Huaizu Jiang, and Varun Jampani. Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency. arXiv preprint arXiv:2407.17470, 2024.
- Xu et al. [2024] Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. Camco: Camera-controllable 3d-consistent image-to-video generation. arXiv preprint arXiv:2406.02509, 2024.
- Yang et al. [2024] Serin Yang, Taesung Kwon, and Jong Chul Ye. Vibidsampler: Enhancing video interpolation using bidirectional diffusion sampler. arXiv preprint arXiv:2410.05651, 2024.
- Yang et al. [2023a] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101, 2023a.
- Yang et al. [2023b] Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, and Li Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. ArXiv, abs/2310.10642, 2023b.
- You et al. [2024] Meng You, Zhiyu Zhu, Hui Liu, and Junhui Hou. Nvs-solver: Video diffusion model as zero-shot novel view synthesizer. arXiv preprint arXiv:2405.15364, 2024.
- Yu et al. [2024] Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, and Hsin-Ying Lee. 4real: Towards photorealistic 4d scene generation via video diffusion models. arXiv preprint arXiv:2406.07472, 2024.
- Zeng et al. [2024] Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao, and Yao Yao. Stag4d: Spatial-temporal anchored generative 4d gaussians. In European Conference on Computer Vision, pages 163–179. Springer, 2024.
- Zhang et al. [2024a] David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. arXiv preprint arXiv:2411.05003, 2024a.
- Zhang et al. [2024b] Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825, 2024b.
- Zhao et al. [2023] Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhenguo Li, and Gim Hee Lee. Animate124: Animating one image to 4d dynamic scene. arXiv preprint arXiv:2311.14603, 2023.
Supplementary Material
6 Pseudocode implementation of Zero4D
The following algorithm outlines the process of generating multi-view videos from a single monocular video. We begin by applying novel view synthesis to the first frame of the input video, utilizing an I2V diffusion model[5] to generate the novel view frame . To guide this process, we incorporate warping-based priors from the original video, enabling inpainting-based synthesis. Using an off-the-shelf depth estimation model [23], we warp the original frame to novel viewpoints, as illustrated in Figure 6. As shown in the figure, occluded regions resulting from the warp operation appear black, allowing us to extract an opacity mask. Inspired by [21, 45, 19], we adopt a mask inpainting approach, where inpainting is performed on the estimated noisy frame . Rather than applying inpainting at every denoising step, as in [19], we utilize a re-noising process within the diffusion model’s denoising step to refine the final synthesis by reducing artifacts and enhancing structural coherence. Once is generated, we apply the same inpainting and re-noising strategy to synthesize the novel-view frame . Furthermore, during the bidirectional interpolation process, we incorporate an additional re-noising step to enhance temporal and spatial consistency.
Input: warped frames , opacity mask
Output: input video
