(Translated by https://www.hiragana.jp/)
Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Model

Zero4D: Training-Free 4D Video Generation From Single Video
Using Off-the-Shelf Video Diffusion Model

Jangho Park, Taesung Kwon, Jong Chul Ye
KAIST
{{\{{jhq1234, star.kwon, jong.ye}}\}}@kaist.ac.kr
Abstract

Recently, multi-view or 4D video generation has emerged as a significant research topic. Nonetheless, recent approaches to 4D generation still struggle with fundamental limitations, as they primarily rely on harnessing multiple video diffusion models with additional training or compute-intensive training of a full 4D diffusion model with limited real-world 4D data and large computational costs. To address these challenges, here we propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video. Our approach consists of two key steps: (1) By designating the edge frames in the spatio-temporal sampling grid as key frames, we first synthesize them using a video diffusion model, leveraging a depth-based warping technique for guidance. This approach ensures structural consistency across the generated frames, preserving spatial and temporal coherence. (2) We then interpolate the remaining frames using a video diffusion model, constructing a fully populated and temporally coherent sampling grid while preserving spatial and temporal consistency. Through this approach, we extend a single video into a multi-view video along novel camera trajectories while maintaining spatio-temporal consistency. Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.

[Uncaptioned image]
Figure 1: Zero4D is a training-free multi-view synchronized video generation framework that takes a single monocular video and generates a grid of camera-time consistent frames. It first utilizes a depth estimation model to warp target view frames from the input video (top-left), then leverages the Image-to-video diffusion model to sample multi-view frames synchronized in both camera and temporal dimensions (top-right). Using a single GPU alongside off-the-shelf video diffusion models without training, our approach can generate multi-view videos for both synthesized and real-world footage. Project Page.

1 Introduction

Since the introduction of the diffusion and foundation models [9, 25, 40], 3D reconstruction has advanced significantly, leading to unprecedented progress in representing the real world in 3D models. Combined with generative models, this success drives a renaissance in 3D generation, enabling more diverse and realistic content creation. These advancements extend beyond static scene or object reconstruction and generation, evolving toward dynamic 3D reconstruction and generation that aims to capture the real world.

Prior works [4, 47, 27, 50, 2] leverage video diffusion models and Score Distillation Sampling (SDS) to enable dynamic 3D generation. However, most of these approaches primarily focus on generating dynamic objects in blank backgrounds (e.g., text-to-4D generation), leaving the challenge of reconstructing or generating real-world scenes from text condition or reference images or video. But, unlike the abundance of high-quality 3D and video datasets, extensive 4D datasets required for such 4D models remain scarce. Therefore, a fundamental challenge in training 4D generation models for real-world scenes is the lack of large-scale, multi-view synchronized video datasets.

To overcome these limitations, recent works such as 4DiM [37] proposes joint training diffusion model with 3D and video with scarce 4D dataset. CAT4D [39] proposes training multi-view video diffusion models by curating a diverse collection of synthetic 4D data, 3D datasets, and monocular video sources. DimensionX [28] trains the spatial-temporal diffusion model independently with multiple LoRA, achieved multi-view videos via an additional refinement process. Despite several approaches, the scarcity of high-quality 4D data makes it difficult to generalize to complex real-world scenes and poses fundamental challenges in training large multi-view video models.

To address these challenges, here we introduce a novel zero-shot framework called Zero4D, short for Zero-shot 4D video generator, which generates multi-view synchronized 4D video from single monocular video by leveraging off-the-shelf video diffusion model [5] without any training. Building upon the prior observations [31, 39] that 4D video is composed of multiple video frames arranged along the spatio-temporal sampling grid (i.e., camera view and time axes), generating a 4D video can be regarded as populating the sampling grid with consistent spatio-temporal frames. Consequently, our approach achieves this through two key steps: (1) We first designate the edge frames in the sampling grid as key frames and synthesize them. Specifically, we employ a depth-based warping technique as guidance in a video diffusion model, ensuring that the generated frames adhere to the underlying scene structure. (2) Inspired by ViBiDSampler [42], we then leverage the interpolation capabilities of a video model to fill in the remaining frames through bidirectional diffusion sampling, ensuring a fully populated and temporally coherent 4D grid. During these steps, our method imposes both spatial and temporal consistency across the grid. Our main contributions can be summarized as follows:

  • We propose a novel framework that can generate 4D video from a single video via an off-the-shelf video diffusion model without any training or large-scale datasets. To the best of our knowledge, our approach is the first training-free method to generate synchronized multi-view video.

  • We introduce a synchronization mechanism to ensure high-quality generation while preserving global spatio-temporal consistency. This is achieved through alternating bidirectional video interpolation sampling along both the camera and temporal axes, effectively aligning motion and appearance across frames.

  • Our approach is computationally efficient, enabling high-quality multi-view video generation with minimal memory consumption, making it significantly more accessible than previous methods.

2 Related Work

Refer to caption
Figure 2: Reconstruction pipeline of Zero4D: (a) Key frame generation step: (a1) Given a fixed-viewpoint grayscale input video, we generate synchronized multi-view videos using the I2V diffusion model. (a2) We first synthesize key frames of the 4D grid through diffusion sampling, guided by warped views. (a3) Next, we generate the end-view video frames using a process similar to (a2), guided by warped views. (a4) Finally, we complete the rightmost column using diffusion-based interpolation sampling. (b) Spatio-temporal bidirectional interpolation step: Starting from the initial noise in (a4), we denoise the remaining frames through a camera-axis interpolation denoising step. At this stage, the noisy frame xt[:,i]subscript𝑥𝑡:𝑖x_{t}[:,i]italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ : , italic_i ] is updated based on edge frames along the camera axis. Then, the time-axis interpolation step follows, where perturbed frames are denoised through interpolation along the time axis of the 4D grid, conditioned on edge frames in the time axis. The spatio-temporal bidirectional sampling alternates between camera-axis interpolation and time-axis interpolation, progressively refining noisy latents into clean frames. Through this process, we obtain globally coherent spatio-temporal 4D grid videos.

Dynamic 3D reconstruction. Numerous studies have explored methods for representing the dynamic nature of the real world in 3D models. DyNeRF [18] introduced 4D novel view synthesis using multi-view video inputs. Building on this foundation, subsequent research has focused on 4D reconstruction using posed multi-view inputs [38, 43, 12, 44]. However, all these methods require a large number of multi-view images with known camera parameters, which poses a significant challenge. With the success of Dust3R [35], which learns 3D representations in a self-supervised manner from unposed images without requiring camera parameters, many subsequent models have attempted to reconstruct real-world dynamic scenes based on this foundational approach. Monst3r [49], built on DUSt3R, enables the reconstruction of both dynamic point clouds and camera poses from unposed single-frame video inputs. Shape-of-Motion [33] attempts 4D reconstruction by tracking points from dynamic motion in single-video inputs without camera parameters. CUT3R [34] reconstructs dynamic scenes from single-video inputs by leveraging state updates of visual tokens within a ViT encoder.

Video generation with camera control.

Several studies try to train a multi-view diffusion model for spatially consistent image generation [26, 32, 20, 14, 7, 22]. Camco [41] fintunes pre-trained video diffusion model with injecting Plücker embedding vector into the specific layer in the model. ReCapture [48] trains the novel camera trajectory video diffusion model from a single reference video with existing scene motion. They leverage trainable multiple LoRA layers on a video model with a camera parameter label to regenerate the anchor video into a natural and temporally consistent novel view video. AC3D [3] enhances Video DiT [10] with ControlNet based camera conditioning, keeping the large frozen VDiT backbone for video synthesis while using lightweight modules to inject camera information. Cameractrl [8] proposes a plug-and-play camera module in the video diffusion model to control video generation with precise and smooth camera view points.

4D generation. Recent advancements in text-to-4D generation have been driven by numerous pioneering works exploring various conditioning methods. Among these, several approaches have leveraged score distillation sampling in conjunction with video diffusion models or multi-view image diffusion models to generate 4D content from text prompts [4, 47, 27, 50, 2]. However, these approaches largely focus on generating dynamic objects in blank backgrounds. Generating full 4D scenes under given constraints has recently emerged as an important research direction with limited prior work. A notable example is [39], which synthesizes 4D videos conditioned on multiple input modalities using a multi-view video model. They extend an image diffusion model into a multi-view video diffusion framework, trained on a carefully curated spatio-temporal dataset. To ensure consistency across viewpoints, they employ an alternating sampling strategy. Similarly, [30] introduces a framework for novel view synthesis of dynamic 4D scenes from a single video. This method is trained on synthetic multi-view video data with corresponding camera poses, enabling high-fidelity 4D reconstructions. Concurrently, [46] propose text-to-4D scene generation pipelines that integrate video diffusion models with canonical 3D Gaussian Splatting (3DGS) [17], ensuring spatio-temporal consistency in the generated 4D outputs. Furthermore, [31] enhances video diffusion models by introducing a parallel camera-temporal token stream and a learnable synchronization layer, which effectively fuses independent tokens to maintain camera and temporal consistency across generated frames.

3 Zero4D

Let x[i,j]H×W,i=1,,N,j=1,,Fformulae-sequence𝑥𝑖𝑗superscript𝐻𝑊formulae-sequence𝑖1𝑁𝑗1𝐹x[i,j]\in\mathbb{R}^{H\times W},i=1,\cdots,N,j=1,\cdots,Fitalic_x [ italic_i , italic_j ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT , italic_i = 1 , ⋯ , italic_N , italic_j = 1 , ⋯ , italic_F denotes the image at the i𝑖iitalic_i-th camera viewpoint and the j𝑗jitalic_j-th temporal frame, where H𝐻Hitalic_H and W𝑊Witalic_W denotes the height and width of the image, respectively (see Fig. 2(a)). Then, the input video captured from a single camera viewpoint c𝑐citalic_c is denoted as x[c,:]𝑥𝑐:x[c,:]italic_x [ italic_c , : ], whereas the multi-view images at the temporal frame f𝑓fitalic_f is represented by x[:,f]𝑥:𝑓x[:,f]italic_x [ : , italic_f ]. Then, The goal of Zero4D is to populate the spatio-temporal sampling grid (or camera-time grid) x[:,:]𝑥::x[:,:]italic_x [ : , : ] by generating frames across multiple camera poses.

As illustrated in Fig. 2, the overall reconstruction pipeline of Zero4D is composed of two steps: 1) key frame generation and 2) Spatio-temporal bidirectional interpolation along the time and camera axis in an alternating manner. In this section, we describe each in detail.

3.1 Key Frame Generation

As shown in Fig. 2(a), the key frame generation is achieved through three successive steps. Specifically, given a fixed-viewpoint grayscale input video denoted by x[1,:]𝑥1:x[1,:]italic_x [ 1 , : ], we first perform novel view synthesis, which is followed by end-view video frame generation. These two steps are achieved through diffusion sampling, guided by warped views. Finally, we complete the rightmost column using diffusion-based interpolation sampling.

Novel view synthesis (a2). At the first stage of key frame generation, we synthesize a novel view x[:,1]𝑥:1x[:,1]italic_x [ : , 1 ] from the first frame x[1,1]𝑥11x[1,1]italic_x [ 1 , 1 ] using the I2V diffusion model. We incorporate the warped frames xw[:,1]subscript𝑥𝑤:1x_{w}[:,1]italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ : , 1 ] as guidance to ensure the generated novel views align with the warped images from input video warping.

The warped frames xw[:,:]subscript𝑥𝑤::x_{w}[:,:]italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ : , : ] is computed as follows. Given an input video x[1,:]𝑥1:x[1,:]italic_x [ 1 , : ], we generate novel views by first estimating a per-frame depth map D[1,:]𝐷1:D[1,:]italic_D [ 1 , : ] using a monocular depth estimation model [23]. This depth information enables depth-based geometric warping, where each frame of the input video is unprojected into 3D space and reprojected into a target viewpoint in p(n)𝒫N𝑝𝑛subscript𝒫𝑁p(n)\in\mathcal{P}_{N}italic_p ( italic_n ) ∈ caligraphic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT where 𝒫Nsubscript𝒫𝑁\mathcal{P}_{N}caligraphic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT defines the desired set of camera views. This produces the warped frames:

xw[n,:]=𝒲(x[1,:],D[1,:],p(n),K),subscript𝑥𝑤𝑛:𝒲𝑥1:𝐷1:𝑝𝑛𝐾x_{w}[n,:]=\mathcal{W}(x[1,:],D[1,:],p(n),K),italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ italic_n , : ] = caligraphic_W ( italic_x [ 1 , : ] , italic_D [ 1 , : ] , italic_p ( italic_n ) , italic_K ) , (1)

for n=1,,N𝑛1𝑁n=1,\cdots,Nitalic_n = 1 , ⋯ , italic_N, where K𝐾Kitalic_K is the intrinsic camera matrix. The warping function 𝒲()𝒲\mathcal{W}(\cdot)caligraphic_W ( ⋅ ) unprojects each pixel using its estimated depth and reprojects it into the target view. Formally, for each pixel location risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the i𝑖iitalic_i-view, the warped pixel location rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the novel view at the j𝑗jitalic_j-th camera location is computed as:

rj=KPijDi(ri)K1ri,subscript𝑟𝑗𝐾subscript𝑃𝑖𝑗subscript𝐷𝑖subscript𝑟𝑖superscript𝐾1subscript𝑟𝑖r_{j}=KP_{i\to j}D_{i}(r_{i})K^{-1}r_{i},italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_K italic_P start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (2)

where Pijsubscript𝑃𝑖𝑗P_{i\to j}italic_P start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT is the transformation from the input to the novel view, and Di(ri)subscript𝐷𝑖subscript𝑟𝑖D_{i}(r_{i})italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the depth at risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Since rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT may not align exactly with integer pixel locations, interpolation is applied to assign pixel values.

However, missing regions (e.g., occlusions from depth-based projection) often appear in xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. To address this, we utilize a video diffusion model parameterized by θ𝜃\thetaitalic_θ to inpaint the missing regions and ensure consistency within the 4D video grid. This can be considered as conditional sampling under the condition of the warped image, occlusion mask and the input video conditioning. For the case of novel view synthesis at the temporal frame index j=1𝑗1j=1italic_j = 1, this corresponds to

x[:,1]pθ(x[:,1]xw[:,1],mw[:,1],c[1,1]),similar-to𝑥:1subscript𝑝𝜃conditional𝑥:1subscript𝑥𝑤:1subscript𝑚𝑤:1𝑐11x[:,1]\sim p_{\theta}(x[:,1]\mid x_{w}[:,1],m_{w}[:,1],c[1,1]),italic_x [ : , 1 ] ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x [ : , 1 ] ∣ italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ : , 1 ] , italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ : , 1 ] , italic_c [ 1 , 1 ] ) , (3)

where pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT corresponds to the conditional distribution from the trained diffusion model, mw[:,:]subscript𝑚𝑤::m_{w}[:,:]italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ : , : ] is an occlusion mask that identifies missing pixels, and c[1,1]𝑐11c[1,1]italic_c [ 1 , 1 ] provides the conditioning input from x[1,1]𝑥11x[1,1]italic_x [ 1 , 1 ]. The specific details of conditional video diffusion sampling will be described in Section 3.3.

Algorithm 1 Iθ::subscript𝐼𝜃absentI_{\theta}:italic_I start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : A sampling step of extended ViBiDSampler for bidirectional interpolation.
1:function Iθsubscript𝐼𝜃I_{\theta}italic_I start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT(xt,σt,cstart,cend,xwsubscript𝑥𝑡subscript𝜎𝑡subscript𝑐𝑠𝑡𝑎𝑟𝑡subscript𝑐𝑒𝑛𝑑subscript𝑥𝑤x_{t},\sigma_{t},c_{start},c_{end},x_{w}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT)
2:     x^cstartDθ(xt;σt,cstart)subscript^𝑥subscript𝑐𝑠𝑡𝑎𝑟𝑡subscript𝐷𝜃subscript𝑥𝑡subscript𝜎𝑡subscript𝑐𝑠𝑡𝑎𝑟𝑡{\color[rgb]{1,.5,0}\hat{x}_{c_{start}}}\leftarrow D_{\theta}(x_{t};\sigma_{t}% ,c_{start})over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT ) \triangleright EDM denosing
3:     x¯cstartx^cstartm+xw(1m)subscript¯𝑥subscript𝑐𝑠𝑡𝑎𝑟𝑡subscript^𝑥subscript𝑐𝑠𝑡𝑎𝑟𝑡𝑚subscript𝑥𝑤1𝑚{\color[rgb]{1,0,0}\bar{x}_{c_{start}}}\leftarrow{\color[rgb]{1,.5,0}\hat{x}_{% c_{start}}}\cdot{m}+{\color[rgb]{1,0,0}{x}_{w}}\cdot(1-m)over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_m + italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ⋅ ( 1 - italic_m )
4:     xt1,cstartx¯cstart+σt1σt(xtx^)subscript𝑥𝑡1subscript𝑐𝑠𝑡𝑎𝑟𝑡subscript¯𝑥subscript𝑐𝑠𝑡𝑎𝑟𝑡subscript𝜎𝑡1subscript𝜎𝑡subscript𝑥𝑡subscript^𝑥x_{t-1,c_{start}}\leftarrow{\color[rgb]{1,0,0}\bar{x}_{c_{start}}}+\frac{% \sigma_{t-1}}{\sigma_{t}}(x_{t}-\hat{x}_{\emptyset})italic_x start_POSTSUBSCRIPT italic_t - 1 , italic_c start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT )
5:     xt,cstartxt1,cstart+σt2σt12ϵsubscript𝑥𝑡subscript𝑐𝑠𝑡𝑎𝑟𝑡subscript𝑥𝑡1subscript𝑐𝑠𝑡𝑎𝑟𝑡superscriptsubscript𝜎𝑡2superscriptsubscript𝜎𝑡12italic-ϵx_{t},c_{start}\leftarrow x_{t-1,c_{start}}+\sqrt{\sigma_{t}^{2}-\sigma_{t-1}^% {2}}\epsilonitalic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT ← italic_x start_POSTSUBSCRIPT italic_t - 1 , italic_c start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + square-root start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_ϵ \triangleright Re-noise
6:     xt,cstartflip(xt,cstart)subscript𝑥𝑡subscript𝑐𝑠𝑡𝑎𝑟𝑡flipsubscript𝑥𝑡subscript𝑐𝑠𝑡𝑎𝑟𝑡x_{t},c_{start}\leftarrow\text{flip}(x_{t},c_{start})italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT ← flip ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT ) \triangleright Time reverse
7:     x^cendDθ(xt,cstart;σt,cend)subscriptsuperscript^𝑥subscript𝑐𝑒𝑛𝑑subscript𝐷𝜃subscriptsuperscript𝑥𝑡subscript𝑐𝑠𝑡𝑎𝑟𝑡subscript𝜎𝑡subscript𝑐𝑒𝑛𝑑\hat{x}^{\prime}_{c_{end}}\leftarrow D_{\theta}(x^{\prime}_{t},c_{start};% \sigma_{t},c_{end})over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT ; italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT ) \triangleright EDM denoising
8:     x¯cendx¯cendm+xw(1m)subscriptsuperscript¯𝑥subscript𝑐𝑒𝑛𝑑subscriptsuperscript¯𝑥subscript𝑐𝑒𝑛𝑑𝑚subscript𝑥𝑤1𝑚{\color[rgb]{1,0,0}\bar{x}^{\prime}_{c_{end}}}\leftarrow{\color[rgb]{1,.5,0}% \bar{x}^{\prime}_{c_{end}}}\cdot{m}+{\color[rgb]{1,0,0}{x}_{w}}\cdot(1-m)over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_m + italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ⋅ ( 1 - italic_m )
9:     xt1x¯cend+σt1σt(xt,cstartx^)subscriptsuperscript𝑥𝑡1subscriptsuperscript¯𝑥subscript𝑐𝑒𝑛𝑑subscript𝜎𝑡1subscript𝜎𝑡subscriptsuperscript𝑥𝑡subscript𝑐𝑠𝑡𝑎𝑟𝑡subscriptsuperscript^𝑥x^{\prime}_{t-1}\leftarrow{\color[rgb]{1,0,0}\bar{x}^{\prime}_{c_{end}}}+\frac% {\sigma_{t-1}}{\sigma_{t}}(x^{\prime}_{t},c_{start}-\hat{x}^{\prime}_{% \emptyset})italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ← over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT )
10:     xt1flip(xt1)subscriptsuperscript𝑥𝑡1flipsubscriptsuperscript𝑥𝑡1x^{\prime}_{t-1}\leftarrow\text{flip}(x^{\prime}_{t-1})italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ← flip ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) \triangleright Time reverse
11:     return xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT
12:end function

End view video generation (a3). Similarly, we can synthesize the end view video x[N,:]𝑥𝑁:x[N,:]italic_x [ italic_N , : ] from the generated view x[N,1]𝑥𝑁1x[N,1]italic_x [ italic_N , 1 ] utilizing warp-guided diffusion sampling.

x[N,:]pθ(x[N,:]xw[N,:],mw[N,:],c[N,1]).similar-to𝑥𝑁:subscript𝑝𝜃conditional𝑥𝑁:subscript𝑥𝑤𝑁:subscript𝑚𝑤𝑁:𝑐𝑁1x[N,:]\sim p_{\theta}(x[N,:]\mid x_{w}[N,:],m_{w}[N,:],c[N,1]).italic_x [ italic_N , : ] ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x [ italic_N , : ] ∣ italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ italic_N , : ] , italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ italic_N , : ] , italic_c [ italic_N , 1 ] ) . (4)

This process follows the same video sampling approach as first-frame novel view synthesis; however, it differs in that it synthesizes the video from the final camera position.

End frame novel view synthesis (a4). Finally, we generate video at the end-frame novel view x[:,F]𝑥:𝐹x[:,F]italic_x [ : , italic_F ], which constitutes the rightmost column of the 4D grid in Fig. 2(a). Given that we already have x[1,F]𝑥1𝐹x[1,F]italic_x [ 1 , italic_F ] from the input video and the synthesized end-view frame x[N,F]𝑥𝑁𝐹x[N,F]italic_x [ italic_N , italic_F ] derived from x[N,:]𝑥𝑁:x[N,:]italic_x [ italic_N , : ], we incorporate both images to enhance consistency. To this end, we employ the idea of ViBiDSampler [42] - the state-of-the-art video interpolation method that allows for the simultaneous conditioning on both c[1,F]𝑐1𝐹c[1,F]italic_c [ 1 , italic_F ] and c[N,F]𝑐𝑁𝐹c[N,F]italic_c [ italic_N , italic_F ]. As described in Algorithm 1, the main innovation from ViBiDSampler is that we additionally incorporate the conditions from the warped image and its mask. Accordingly, we synthesize the last column x[:,F]𝑥:𝐹x[:,F]italic_x [ : , italic_F ] using bidirectional video diffusion sampling:

xt[:,F]subscript𝑥𝑡:𝐹\displaystyle x_{t}[:,F]italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ : , italic_F ] =Iθ(xt[:,F],σt,c[0,F],c[N,F],xw[:,F])absentsubscript𝐼𝜃subscript𝑥𝑡:𝐹subscript𝜎𝑡𝑐0𝐹𝑐𝑁𝐹subscript𝑥𝑤:𝐹\displaystyle=I_{\theta}\big{(}x_{t}[:,F],\sigma_{t},c[0,F],c[N,F],x_{w}[:,F]% \big{)}= italic_I start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ : , italic_F ] , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c [ 0 , italic_F ] , italic_c [ italic_N , italic_F ] , italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ : , italic_F ] ) (5)
fort=T0.for𝑡𝑇0\displaystyle\quad\text{for}\quad t=T\to 0.for italic_t = italic_T → 0 .

where Iθsubscript𝐼𝜃I_{\theta}italic_I start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denotes the one-step bidirectional video interpolation using the extended version of ViBiDSampler. The final novel-view frame x[:,F]𝑥:𝐹x[:,F]italic_x [ : , italic_F ] is obtained iteratively by applying Iθsubscript𝐼𝜃I_{\theta}italic_I start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT over diffusion time steps t=T0𝑡𝑇0t=T\to 0italic_t = italic_T → 0

3.2 Spatio-Temporal Bidirectional Interpolation

As shown in Fig. 2(b), once the keyframes are generated, the remaining task is to fill in the missing sampling grid at the center so the final resulting 4D video remains consistent across both the camera and time axes. Accordingly, it is essential to perform conditioned sampling using the key frames and adjacent frames from the camera and temporal axes. However, a naive image-to-video diffusion model can only condition on a single or two end frames.

To address this challenge, inspired by ViBiDSampler [42], we propose a novel Spatio-Temporal Bidirectional Interpolation (STBI), which enables simultaneous conditioning on both the camera and time axes. The key idea is alternating the aforementioned bidirectional sampling along both camera and time axes so that the overall diffusion sampling trajectory is driven to satisfy multiple conditions from the key frames. More details follow.

Camera axis interpolation. Starting from the initial noise xT[:,:]𝒩(0,I)similar-tosubscript𝑥𝑇::𝒩0𝐼x_{T}[:,:]\sim\mathcal{N}(0,I)italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT [ : , : ] ∼ caligraphic_N ( 0 , italic_I ) (line 9 in algorithm 2), we then select a specific frame in the 4D grid (a column) xt[:,i]subscript𝑥𝑡:𝑖x_{t}[:,i]italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ : , italic_i ], and perform an interpolation denoising process using the edge-frame conditions c[1,i]𝑐1𝑖c[1,i]italic_c [ 1 , italic_i ] and c[N,i]𝑐𝑁𝑖c[N,i]italic_c [ italic_N , italic_i ]:

xt1[:,i]Iθ(xt[:,i],σt,c[1,i],c[N,i],xw[:,i])subscript𝑥𝑡1:𝑖subscript𝐼𝜃subscript𝑥𝑡:𝑖subscript𝜎𝑡𝑐1𝑖𝑐𝑁𝑖subscript𝑥𝑤:𝑖x_{t-1}[:,i]\leftarrow I_{\theta}(x_{t}[:,i],\sigma_{t},c[1,i],c[N,i],x_{w}[:,% i])italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT [ : , italic_i ] ← italic_I start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ : , italic_i ] , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c [ 1 , italic_i ] , italic_c [ italic_N , italic_i ] , italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ : , italic_i ] ) (6)

In this process, the image condition c[0,i]𝑐0𝑖c[0,i]italic_c [ 0 , italic_i ] is applied first, along with the warped view to guide the diffusion denoising step. The video is then perturbed with noise again, flipped along the camera axis, and subjected to another diffusion denoising step using c[N,i]𝑐𝑁𝑖c[N,i]italic_c [ italic_N , italic_i ] as the condition. Through these two conditioning steps, xt[:,i]subscript𝑥𝑡:𝑖x_{t}[:,i]italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ : , italic_i ] integrates information from both c[1,i]𝑐1𝑖c[1,i]italic_c [ 1 , italic_i ] and c[N,i]𝑐𝑁𝑖c[N,i]italic_c [ italic_N , italic_i ], enabling interpolation-based denoising that preserves consistency across the camera axis. Before proceeding with time axis interpolation, we apply a re-noising step, to ensure smooth transitions across generated frames.

Time axis interpolation. After ensuring spatial consistency across the camera axis, we interpolate frames along the time axis to maintain temporal coherence. For each row xt[j,:]subscript𝑥𝑡𝑗:x_{t}[j,:]italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_j , : ] in the 4D grid, we perform an interpolation denoising step (7) using the start and end frame conditions c[j,1]𝑐𝑗1c[j,1]italic_c [ italic_j , 1 ] and c[j,F]𝑐𝑗𝐹c[j,F]italic_c [ italic_j , italic_F ].

xt[j,:]Iθ(xt[j,:],σt,c[j,1],c[j,F],xw[j,w])subscript𝑥𝑡𝑗:subscript𝐼𝜃subscript𝑥𝑡𝑗:subscript𝜎𝑡𝑐𝑗1𝑐𝑗𝐹subscript𝑥𝑤𝑗𝑤x_{t}[j,:]\leftarrow I_{\theta}\bigl{(}x_{t}[j,:],\sigma_{t},c[j,1],c[j,F],x_{% w}[j,w]\bigr{)}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_j , : ] ← italic_I start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_j , : ] , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c [ italic_j , 1 ] , italic_c [ italic_j , italic_F ] , italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ italic_j , italic_w ] ) (7)

Initially, c[j,1]𝑐𝑗1c[j,1]italic_c [ italic_j , 1 ] is applied along with the warped view to guide the diffusion denoising step. The frame is then perturbed with noise, flipped along the time axis, and another diffusion denoising step is performed using c[j,F]𝑐𝑗𝐹c[j,F]italic_c [ italic_j , italic_F ] as the condition. Through this bidirectional conditioning process, xt[j,:]subscript𝑥𝑡𝑗:x_{t}[j,:]italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_j , : ] effectively integrates information from both c[j,1]𝑐𝑗1c[j,1]italic_c [ italic_j , 1 ] and c[j,F]𝑐𝑗𝐹c[j,F]italic_c [ italic_j , italic_F ], facilitating interpolation-based denoising that ensures smooth transitions along the time axis.

Throughout the diffusion step, we perform denoising by alternating interpolation along the camera axis and time axis. This approach maintains global coherence while ensuring consistency in multi-view video generation. The overall sampling procedure of Zero4D is summarized in Table 2 in Appendix Section 6.

3.3 Details of Conditional Video Diffusion

In this work, we build upon Stable Video Diffusion (SVD) [5], an image-to-video diffusion model that follows the principles of the EDM framework [15]. SVD utilizes an iterative denoising approach based on an Euler step method, which progressively transforms a Gaussian noise sample xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT into a clean signal x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT:

xt1(xt;σt,c):=x^c(xt)+σt1σt(xtx^c(xt)),assignsubscript𝑥𝑡1subscript𝑥𝑡subscript𝜎𝑡𝑐subscript^𝑥𝑐subscript𝑥𝑡subscript𝜎𝑡1subscript𝜎𝑡subscript𝑥𝑡subscript^𝑥𝑐subscript𝑥𝑡x_{t-1}(x_{t};\sigma_{t},c):=\hat{x}_{c}(x_{t})+\frac{\sigma_{t-1}}{\sigma_{t}% }\,\bigl{(}x_{t}-\hat{x}_{c}(x_{t})\bigr{)},italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) := over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (8)

where the initial noise is xT𝒩(0,I)similar-tosubscript𝑥𝑇𝒩0𝐼x_{T}\sim\mathcal{N}(0,I)italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ), x^c(xt)subscript^𝑥𝑐subscript𝑥𝑡\hat{x}_{c}(x_{t})over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the denoised estimate by Tweedie’s formula using the score function trained by the neural network parameterized by θ𝜃\thetaitalic_θ, and σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the discretized noise level for each timestep t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ].

Refer to caption
Figure 3: Result from zero-4D. Our model can generate high-quality multi-view videos from a single input video. By rotating the viewpoint in 15-degree increments, we produce novel-view videos based on the given input. As shown in the figure, the generated videos maintain consistency across multiple views and frames, effectively synthesizing perspectives that were not visible in the original video.

Now, we describe how to modify SVD to enable conditional sampling under the condition on the warped image xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, the occlusion mask m𝑚mitalic_m and the conditioning input c𝑐citalic_c. For convenience, we refer to xt[:,:]subscript𝑥𝑡::x_{t}[:,:]italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ : , : ] as xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. From the formulation of the reverse diffusion sampling process in (8), The reverse diffusion process can be modulated by conditioning on a known scene-prior xknownsubscript𝑥knownx_{\text{known}}italic_x start_POSTSUBSCRIPT known end_POSTSUBSCRIPT, as proposed in [21]:

x¯c(xt)=x^c(xt)m+xknown(1m),subscript¯𝑥𝑐subscript𝑥𝑡subscript^𝑥𝑐subscript𝑥𝑡𝑚subscript𝑥known1𝑚\bar{x}_{c}(x_{t})=\hat{x}_{c}(x_{t})\cdot{m}+{x_{\text{known}}}\cdot(1-m),over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ italic_m + italic_x start_POSTSUBSCRIPT known end_POSTSUBSCRIPT ⋅ ( 1 - italic_m ) , (9)

where m𝑚mitalic_m is a mask that determines which parts of the scene are known, guiding the denoising process by preserving the known regions while allowing the diffusion model to inpaint the missing areas. In our approach, rather than relying on an externally defined scene-prior xknownsubscript𝑥knownx_{\text{known}}italic_x start_POSTSUBSCRIPT known end_POSTSUBSCRIPT, we leverage the warped frames xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT obtained from depth-based warping as the conditional guidance. Specifically, we redefine the denoising process by replacing xknownsubscript𝑥knownx_{\text{known}}italic_x start_POSTSUBSCRIPT known end_POSTSUBSCRIPT with xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and substituting m𝑚mitalic_m with the occlusion mask mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT:

x¯c(xt)=x^c(xt)mw+xw(1mw).subscript¯𝑥𝑐subscript𝑥𝑡subscript^𝑥𝑐subscript𝑥𝑡subscript𝑚𝑤subscript𝑥𝑤1subscript𝑚𝑤\bar{x}_{c}(x_{t})=\hat{x}_{c}(x_{t})\cdot{m_{w}}+{x_{w}}\cdot(1-m_{w}).over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ⋅ ( 1 - italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) . (10)

Here, the occlusion mask mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ensures that the visible regions in xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT directly guide the denoising process, while the unseen parts are inpainted using the learned prior. By incorporating this modified formulation into the reverse diffusion process, we obtain the following sampling update:

xt1(xt;σt,c)x¯c(xt)+σt1σt(xtx^c(xt)),subscript𝑥𝑡1subscript𝑥𝑡subscript𝜎𝑡𝑐subscript¯𝑥𝑐subscript𝑥𝑡subscript𝜎𝑡1subscript𝜎𝑡subscript𝑥𝑡subscript^𝑥𝑐subscript𝑥𝑡x_{t-1}(x_{t};\sigma_{t},c)\leftarrow\bar{x}_{c}(x_{t})+\frac{\sigma_{t-1}}{% \sigma_{t}}\,\bigl{(}x_{t}-\hat{x}_{c}(x_{t})\bigr{)},italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) ← over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (11)

where the target camera viewpoints influence the generated frames through the depth-warped observations xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, ensuring geometric consistency during video synthesis. Throughout the diffusion reverse sampling process, we iteratively apply this procedure. Additionally, following the approach described in [21, 19], we incorporate resampling annealing to further enhance the output quality.

4 Experiments

We used the SVD[5] as an I2V model for all experiments without any training. The image resolution was fixed at 576×1024, with 25 cameras and a sequence length of 25 frames, a total of multi-view video frames are 625=252superscript25225^{2}25 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. All frames were generated to form a multi-view video following the target camera trajectory. For depth-based warping, we utilized off-the-shelf depth models [23, 11]. For depth-based warping, we utilized off-the-shelf depth models [23, 11] with a view variation of 15 degrees.

Refer to caption
Figure 4: Baseline comparison. We compare our approach with existing multi-view video generation methods, including SV4D [40] and GCD [29]. These methods take a single-view video as input and generate a novel-view video. Since SV4D and GCD rely on training-based models, their performance degrades when processing videos that deviate from the distribution of their training data, often resulting in blurry or distorted outputs. In contrast, our method operates without additional training, effectively preserving spatio-temporal consistency and producing high-quality multi-view videos.
Method FVD \downarrow FID \downarrow LPIPS \downarrow Subject Consistency \uparrow Background Consistency \uparrow Temporal Flickering \uparrow Motion Smoothness \uparrow Dynamic Degree \uparrow Image Quality \uparrow Aesthetic Quality \uparrow
SV4D [40] 1502.4 200.74 0.4002 86.46% 92.71% 92.04% 97.78% 55.00% 97.78% 50.36%
CGD [36] 1584.4 238.17 0.4947 91.73% 95.55% 98.41% 98.86% 20.00% 44.55% 42.67%
Ours 896.29 135.08 0.3997 94.18% 94.53% 97.35% 98.13% 55.77% 50.08% 62.09%
Table 1: Quantitative result. We evaluate our method against SV4D and GCD on FVD, FID, LPIPS, and VBench[13], comparing multi-view video results based on novel view generation from a fixed camera viewpoint. Our method achieves the best performance in both frame consistency across videos and image quality of individual frames.

Baseline Models

We selected publicly available methods as baselines. SV4D [40] is an image-to-video (I2V) model that generates multiple novel-view videos from a single input, constructing a 4D image matrix of the object. It synthesizes target-view videos by applying consistent azimuthal displacement, producing 21-frame outputs at 512×512512512512\times 512512 × 512 resolution. GCD [29] also takes a single video as input, allowing the selection of specific azimuth and elevation angles to generate novel views of dynamic 4D scenes. It produces 25-frame videos at 384×256384256384\times 256384 × 256 resolution. All experiments were conducted on an RTX 4090 GPU (24GB VRAM). While both SV4D and GCD require extensive training, which is impractical on a single GPU, our approach eliminates the need for training, enabling efficient execution on a single RTX 4090.

Refer to caption
Figure 5: Ablation study. Given a single input video, our model generates high-quality multi-view videos. When multi-view video generation is performed without Spatio-Temporal Bidirectional Interpolation (STBI), each frame is synthesized independently, leading to inconsistencies across frames (see the red box). In contrast, STBI aggregates global information, enabling the generation of multi-view videos with improved frame consistency.

Multi-View Video

To validate our method, we conducted experiments on 10 DAVIS [24] and 20 Pexel scenes. Novel views were synthesized by adjusting the target camera’s azimuth in 15-degree increments, following the evaluation protocol in [29]. For fair comparison, videos from our method and SV4D were resized to 384×256384256384\times 256384 × 256. To assess video quality, we utilized VBench [13], which evaluates seven key aspects of video generation, including subject identity retention, motion coherence, and temporal consistency. It incorporates advanced feature extractors such as DINO [6] for consistency and MUSIQ [16] for image quality, offering a detailed performance analysis. We further compared the ability of each baseline model to generate novel view videos given a single input video. As shown in Figure 4, our method demonstrates robust novel view synthesis, whereas SV4D and GCD exhibit reduced object consistency and distorted motion. This observation is further supported by the quantitative results in Table 1, where our method outperforms others in FVD, FID, LPIPS, and Subject Consistency, demonstrating superior overall video generation quality.

Bullet-Time and Camera Control

We evaluate the generated multi-view video by controlling the camera at fixed time points within the video. To further assess 3D consistency along the camera axis at fixed time, we utilize MEt3R [1], a state-of-the-art 3D reconstruction model. This measures 3D consistency in videos based on unposed images using DUSt3R[35] as its backbone. Currently, no open-source multi-view video generation model explicitly supports bullet-time generation. Thus, we conducted an ablation study to analyze its impact. To comprehensively evaluate bullet-time video quality, we employed FVD, FID, and LPIPS. By incorporating both spatio-temporal bidirectional interpolation (STBI) and warping, we achieved well-synchronized multi-view videos across both camera and time axes. As shown in Table 3, our method outperforms others across all key metrics. However, when STBI was not used, each frame underwent novel view synthesis independently. While this preserved per-frame 3D consistency, the absence of temporal coherence led to inferior FVD performance compared to our full method, which maintains global consistency across the entire video.

Method FVD \downarrow FID \downarrow LPIPS \downarrow Subject Consistency \uparrow Background Consistency \uparrow Temporal Flickering \uparrow Motion Smoothness \uparrow Dynamic Degree \uparrow Image Quality \uparrow Aesthetic Quality \uparrow
Ours 896.29 135.08 0.3997 94.18% 94.53% 97.35% 98.13% 55.77% 50.08% 62.09%
w/o STBI 1419.8 108.92 0.3797 92.57% 93.59% 94.49% 94.49% 100% 52.93% 67.57%
w/o warp 1830.5 205.08 0.4547 92.91% 93.45% 97.48% 98.01% 48.28% 44.97% 47.26%
Table 2: Fixed view video Quantitative ablation. We performed ablation studies on videos synthesized from fixed novel viewpoints, showing that our method achieves optimal performance when all components are incorporated.
Method FVD \downarrow FID \downarrow LPIPS \downarrow MEt3R \downarrow
Ours 1470 106.31 0.3174 0.039
w/o STBI 1607 108.21 0.3181 0.031
w/o WAP 1923 179.99 0.3617 0.082
Table 3: Bullet-time Video Quantitative Ablation. We conducted ablation experiments to evaluate how well videos generated at fixed time points maintain quality. The results confirm that our method achieves the best performance when fully utilized.

Ablation

We conducted comparative experiments under the following conditions: (1) Without warped frame guidance (see (10)). In this setting, we excluded key frame generation and removed warped frames from the input during camera-time interpolation. As shown in Table 2, conditioning the diffusion model solely on keyframes results in less distinct and less detailed images. This effect is quantitatively evident in the table, where the absence of warped frame guidance leads to degraded performance across key metrics. (2) Without spatio-temporal bidirectional interpolation (STBI). We also examined cases where STBI was omitted, generating each novel view independently per frame. Without global aggregation along the time axis, this approach fails to maintain multi-view coherence. In contrast, our method incorporates key frame generation to capture global scene information, enabling bidirectional frame synthesis. As reflected in Table 3, this results in better synchronization and improved temporal consistency, highlighting the role of STBI in preserving video coherence.

User Study

Method View Angle General Quality Smoothness BG Quality
Ours 67% 70% 60% 68%
GCD 22% 21% 27% 22%
SV4D 11% 9% 13% 10%
Table 4: User study. Winning rates across four key evaluation metrics. Our method consistently outperforms the baselines, particularly in General Quality and Background Quality.

To evaluate our approach, we conducted a user study comparing Ours, GCD, and SV4D across four key metrics: View Angle, General Quality, Smoothness, and Background Quality. Participants viewed generated videos and selected the most visually appealing results. As shown in Table 4, our method consistently achieved the highest user preference, particularly in General Quality (70%) and Background Quality (68%), highlighting superior fidelity and scene preservation. View Angle (67%) results confirm precise novel view synthesis, while Smoothness (60%) indicates seamless transitions with minimal distortion.

5 Conclusion

In this work, we introduced a novel training-free approach for synchronized multi-view video generation using an off-the-shelf video diffusion model. Our method generates high-quality 4D video grids through depth-based warping and spatio-time bidirectional interpolation, ensuring structural consistency across spatial and temporal domains. Unlike prior methods requiring extensive training on large-scale 4D datasets, our framework achieves competitive performance without any training. Extensive experiments show that our approach generates synchronized multi-view videos with superior subject consistency, motion smoothness, and temporal stability. It outperforms baselines in aesthetic quality and imaging fidelity while remaining computationally efficient. Our framework offers a practical solution for multi-view video generation, especially when large-scale 4D datasets and computational resources are scarce. Future work may explore dynamic scenes or adaptive interpolation strategies to enhance novel view synthesis.

References

  • Asim et al. [2024] Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images, 2024.
  • Bahmani et al. [2024a] Sherwin Bahmani, Xian Liu, Wang Yifan, Ivan Skorokhodov, Victor Rong, Ziwei Liu, Xihui Liu, Jeong Joon Park, Sergey Tulyakov, Gordon Wetzstein, et al. Tc4d: Trajectory-conditioned text-to-4d generation. In European Conference on Computer Vision, pages 53–72. Springer, 2024a.
  • Bahmani et al. [2024b] Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. arXiv preprint arXiv:2411.18673, 2024b.
  • Bahmani et al. [2024c] Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024c.
  • Blattmann et al. [2023] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  • Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  • Gao et al. [2024] Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024.
  • He et al. [2024] Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020.
  • Hong et al. [2022] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  • Hu et al. [2024] Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. arXiv preprint arXiv:2409.02095, 2024.
  • Huang et al. [2023] Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. arXiv preprint arXiv:2312.14937, 2023.
  • Huang et al. [2024] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024.
  • Kant et al. [2024] Yash Kant, Aliaksandr Siarohin, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, and Igor Gilitschenski. Spad: Spatially aware multi-view diffusers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10026–10038, 2024.
  • Karras et al. [2022] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35:26565–26577, 2022.
  • Ke et al. [2021] Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023.
  • Li et al. [2022] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5521–5531, 2022.
  • Liu et al. [2024] Kunhao Liu, Ling Shao, and Shijian Lu. Novel view extrapolation with video diffusion priors. arXiv preprint arXiv:2411.14208, 2024.
  • Liu et al. [2023] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023.
  • Lugmayr et al. [2022] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022.
  • Melas-Kyriazi et al. [2024] Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, and Filippos Kokkinos. Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation. arXiv preprint arXiv:2402.08682, 2024.
  • Piccinelli et al. [2024] Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024.
  • Pont-Tuset et al. [2017] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  • Rombach et al. [2021] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
  • Shi et al. [2023] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  • Singer et al. [2023] Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280, 2023.
  • Sun et al. [2024] Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, and Yikai Wang. Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928, 2024.
  • Van Hoorick et al. [2024a] Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl Vondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. In European Conference on Computer Vision, pages 313–331. Springer, 2024a.
  • Van Hoorick et al. [2024b] Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl Vondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. In European Conference on Computer Vision, pages 313–331. Springer, 2024b.
  • Wang et al. [2024a] Chaoyang Wang, Peiye Zhuang, Tuan Duc Ngo, Willi Menapace, Aliaksandr Siarohin, Michael Vasilkovsky, Ivan Skorokhodov, Sergey Tulyakov, Peter Wonka, and Hsin-Ying Lee. 4real-video: Learning generalizable photo-realistic 4d video diffusion. arXiv preprint arXiv:2412.04462, 2024a.
  • Wang and Shi [2023] Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023.
  • Wang et al. [2024b] Qianqian Wang, Vickie Ye, Hang Gao, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruction from a single video. arXiv preprint arXiv:2407.13764, 2024b.
  • Wang et al. [2025] Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state, 2025.
  • Wang et al. [2024c] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024c.
  • Wang et al. [2024d] Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. In ACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024d.
  • Watson et al. [2024] Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasacchi, and David J Fleet. Controlling space and time with diffusion models. arXiv preprint arXiv:2407.07860, 2024.
  • Wu et al. [2024a] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20310–20320, 2024a.
  • Wu et al. [2024b] Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. arXiv preprint arXiv:2411.18613, 2024b.
  • Xie et al. [2024] Yiming Xie, Chun-Han Yao, Vikram Voleti, Huaizu Jiang, and Varun Jampani. Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency. arXiv preprint arXiv:2407.17470, 2024.
  • Xu et al. [2024] Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. Camco: Camera-controllable 3d-consistent image-to-video generation. arXiv preprint arXiv:2406.02509, 2024.
  • Yang et al. [2024] Serin Yang, Taesung Kwon, and Jong Chul Ye. Vibidsampler: Enhancing video interpolation using bidirectional diffusion sampler. arXiv preprint arXiv:2410.05651, 2024.
  • Yang et al. [2023a] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101, 2023a.
  • Yang et al. [2023b] Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, and Li Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. ArXiv, abs/2310.10642, 2023b.
  • You et al. [2024] Meng You, Zhiyu Zhu, Hui Liu, and Junhui Hou. Nvs-solver: Video diffusion model as zero-shot novel view synthesizer. arXiv preprint arXiv:2405.15364, 2024.
  • Yu et al. [2024] Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, and Hsin-Ying Lee. 4real: Towards photorealistic 4d scene generation via video diffusion models. arXiv preprint arXiv:2406.07472, 2024.
  • Zeng et al. [2024] Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao, and Yao Yao. Stag4d: Spatial-temporal anchored generative 4d gaussians. In European Conference on Computer Vision, pages 163–179. Springer, 2024.
  • Zhang et al. [2024a] David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. arXiv preprint arXiv:2411.05003, 2024a.
  • Zhang et al. [2024b] Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825, 2024b.
  • Zhao et al. [2023] Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhenguo Li, and Gim Hee Lee. Animate124: Animating one image to 4d dynamic scene. arXiv preprint arXiv:2311.14603, 2023.
\thetitle

Supplementary Material

6 Pseudocode implementation of Zero4D

Algorithm 2 Zero4D overall pipeline
1:x[1,:],xw[:,:],mw[:,:]𝑥1:subscript𝑥𝑤::subscript𝑚𝑤::x[1,:],x_{w}[:,:],m_{w}[:,:]italic_x [ 1 , : ] , italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ : , : ] , italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ : , : ]
2:x[:,1]pθ(x[:,1]xw[:,1],mw[:,1],c[1,1]).similar-to𝑥:1subscript𝑝𝜃conditional𝑥:1subscript𝑥𝑤:1subscript𝑚𝑤:1𝑐11x[:,1]\sim p_{\theta}(x[:,1]\mid x_{w}[:,1],m_{w}[:,1],c[1,1]).italic_x [ : , 1 ] ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x [ : , 1 ] ∣ italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ : , 1 ] , italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ : , 1 ] , italic_c [ 1 , 1 ] ) .
3:x[N,:]pθ(x[N,:]xw[N,:],mw[N,:],c[N,1]).similar-to𝑥𝑁:subscript𝑝𝜃conditional𝑥𝑁:subscript𝑥𝑤𝑁:subscript𝑚𝑤𝑁:𝑐𝑁1x[N,:]\sim p_{\theta}(x[N,:]\mid x_{w}[N,:],m_{w}[N,:],c[N,1]).italic_x [ italic_N , : ] ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x [ italic_N , : ] ∣ italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ italic_N , : ] , italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ italic_N , : ] , italic_c [ italic_N , 1 ] ) .
4:xt[:,F]𝒩(0,I)similar-tosubscript𝑥𝑡:𝐹𝒩0𝐼x_{t}[:,F]\sim\mathcal{N}(0,I)italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ : , italic_F ] ∼ caligraphic_N ( 0 , italic_I )
5:for t=T to 1𝑡𝑇 to 1t=T\textbf{ to }1italic_t = italic_T to 1 do
6:    xt[:,F]Iθ(xt[:,F],σt,c[N,1],c[N,F],xw[:,F])subscript𝑥𝑡:𝐹subscript𝐼𝜃subscript𝑥𝑡:𝐹subscript𝜎𝑡𝑐𝑁1𝑐𝑁𝐹subscript𝑥𝑤:𝐹x_{t}[:,F]\leftarrow I_{\theta}(x_{t}[:,F],\sigma_{t},c[N,1],c[N,F],x_{w}[:,F])italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ : , italic_F ] ← italic_I start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ : , italic_F ] , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c [ italic_N , 1 ] , italic_c [ italic_N , italic_F ] , italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ : , italic_F ] )
7:end for\triangleright Finishing Keyframe generation
8:c[1,:],c[:,1]Encode(x[1,:],x[:,1])𝑐1:𝑐:1Encode𝑥1:𝑥:1c[1,:],\,c[:,1]\;\;\leftarrow\;\text{Encode}\bigl{(}x[1,:],\,x[:,1]\bigr{)}italic_c [ 1 , : ] , italic_c [ : , 1 ] ← Encode ( italic_x [ 1 , : ] , italic_x [ : , 1 ] )
9:c[:,F],c[N,:]Encode(x[:,F],x[N,:])𝑐:𝐹𝑐𝑁:Encode𝑥:𝐹𝑥𝑁:c[:,F],\,c[N,:]\;\;\leftarrow\;\text{Encode}\bigl{(}x[:,F],\,x[N,:]\bigr{)}italic_c [ : , italic_F ] , italic_c [ italic_N , : ] ← Encode ( italic_x [ : , italic_F ] , italic_x [ italic_N , : ] )
10:xT[:,:]𝒩(0,I)similar-tosubscript𝑥𝑇::𝒩0𝐼x_{T}[:,:]\sim\mathcal{N}(0,I)italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT [ : , : ] ∼ caligraphic_N ( 0 , italic_I )
11:for t=T𝑡𝑇t=Titalic_t = italic_T to 1111 do\triangleright Camera-time BiDi interpolation
12:    for i=1𝑖1i=1italic_i = 1 to n𝑛nitalic_n do\triangleright Camera axis interpolation
13:         xt1[:,i]Iθ(xt[:,i],σt,c[1,i],c[N,i],xw[:,i])subscript𝑥𝑡1:𝑖subscript𝐼𝜃subscript𝑥𝑡:𝑖subscript𝜎𝑡𝑐1𝑖𝑐𝑁𝑖subscript𝑥𝑤:𝑖x_{t-1}[:,i]\leftarrow I_{\theta}(x_{t}[:,i],\sigma_{t},c[1,i],c[N,i],x_{w}[:,% i])italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT [ : , italic_i ] ← italic_I start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ : , italic_i ] , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c [ 1 , italic_i ] , italic_c [ italic_N , italic_i ] , italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ : , italic_i ] )
14:         xt[:,i]xt1[:,i]+σt2σt12ϵsubscript𝑥𝑡:𝑖subscript𝑥𝑡1:𝑖superscriptsubscript𝜎𝑡2superscriptsubscript𝜎𝑡12italic-ϵx_{t}[{:,i}]\leftarrow x_{t-1}[:,i]+\sqrt{\sigma_{t}^{2}-\sigma_{t-1}^{2}}\epsilonitalic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ : , italic_i ] ← italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT [ : , italic_i ] + square-root start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_ϵ \triangleright Renoise
15:    end for
16:    for j=1𝑗1j=1italic_j = 1 to m𝑚mitalic_m do\triangleright Time axis interpolation
17:         xt[j,:]Iθ(xt[j,:],σt,c[j,1],c[j,F],xw[j,:])subscript𝑥𝑡𝑗:subscript𝐼𝜃subscript𝑥𝑡𝑗:subscript𝜎𝑡𝑐𝑗1𝑐𝑗𝐹subscript𝑥𝑤𝑗:x_{t}[j,:]\leftarrow I_{\theta}(x_{t}[j,:],\sigma_{t},c[j,1],c[j,F],x_{w}[j,:])italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_j , : ] ← italic_I start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_j , : ] , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c [ italic_j , 1 ] , italic_c [ italic_j , italic_F ] , italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ italic_j , : ] )
18:    end for
19:end for
20:return x0[:,:]subscript𝑥0::x_{0}[:,:]italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT [ : , : ]

The following algorithm outlines the process of generating multi-view videos from a single monocular video. We begin by applying novel view synthesis to the first frame of the input video, utilizing an I2V diffusion model[5] to generate the novel view frame x[:,1]𝑥:1x[:,1]italic_x [ : , 1 ]. To guide this process, we incorporate warping-based priors from the original video, enabling inpainting-based synthesis. Using an off-the-shelf depth estimation model [23], we warp the original frame to novel viewpoints, as illustrated in Figure 6. As shown in the figure, occluded regions resulting from the warp operation appear black, allowing us to extract an opacity mask. Inspired by [21, 45, 19], we adopt a mask inpainting approach, where inpainting is performed on the estimated noisy frame x^0[:,1]subscript^𝑥0:1\hat{x}_{0}[:,1]over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT [ : , 1 ]. Rather than applying inpainting at every denoising step, as in [19], we utilize a re-noising process within the diffusion model’s denoising step to refine the final synthesis by reducing artifacts and enhancing structural coherence. Once x[:,1]𝑥:1x[:,1]italic_x [ : , 1 ] is generated, we apply the same inpainting and re-noising strategy to synthesize the novel-view frame x[N,:]𝑥𝑁:x[N,:]italic_x [ italic_N , : ]. Furthermore, during the bidirectional interpolation process, we incorporate an additional re-noising step to enhance temporal and spatial consistency.

Algorithm 3 Novel view synthesis method from [19]

Input: warped frames xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, opacity mask 𝐦𝐦\mathbf{m}bold_m
Output: input video x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

1:xT𝒩(0,1)similar-tosubscript𝑥𝑇𝒩01x_{T}\sim\mathcal{N}(0,1)italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 )
2:for t=T,,1𝑡𝑇1t=T,\dots,1italic_t = italic_T , … , 1 do
3:       if t>TTguide𝑡𝑇superscript𝑇guidet>T-T^{\text{guide}}italic_t > italic_T - italic_T start_POSTSUPERSCRIPT guide end_POSTSUPERSCRIPT then
4:             for r=1,,R𝑟1𝑅r=1,\dots,Ritalic_r = 1 , … , italic_R do
5:                    x^0=Predict(xt)subscript^𝑥0Predictsubscript𝑥𝑡\hat{x}_{0}=\text{Predict}(x_{t})over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = Predict ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
6:                    if rRguide𝑟superscript𝑅guider\leq R^{\text{guide}}italic_r ≤ italic_R start_POSTSUPERSCRIPT guide end_POSTSUPERSCRIPT then
7:                          x^0Dθ(xt;σt,cx0)subscript^𝑥0subscript𝐷𝜃subscript𝑥𝑡subscript𝜎𝑡subscript𝑐subscript𝑥0{\color[rgb]{1,.5,0}\hat{x}_{0}}\leftarrow D_{\theta}(x_{t};\sigma_{t},c_{x_{0% }})over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
8:                          x¯0x^0m+xw(1m)subscript¯𝑥0subscript^𝑥0𝑚subscript𝑥𝑤1𝑚{\color[rgb]{1,0,0}\bar{x}_{0}}\leftarrow{\color[rgb]{1,.5,0}\hat{x}_{0}}\cdot% {m}+{\color[rgb]{1,0,0}{x}_{w}}\cdot(1-m)over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ italic_m + italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ⋅ ( 1 - italic_m )
9:                    else
10:                          x¯0=x^0subscript¯𝑥0subscript^𝑥0{\color[rgb]{1,0,0}\bar{x}_{0}}={\color[rgb]{1,.5,0}\hat{x}_{0}}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
11:                    end if
12:                    xt1x¯0+σt1σt(xtx^0)subscript𝑥𝑡1subscript¯𝑥0subscript𝜎𝑡1subscript𝜎𝑡subscript𝑥𝑡subscript^𝑥0x_{t-1}\leftarrow{\color[rgb]{1,0,0}\bar{x}_{0}}+\frac{\sigma_{t-1}}{\sigma_{t% }}(x_{t}-\hat{x}_{0})italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ← over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
13:                    if r<R𝑟𝑅r<Ritalic_r < italic_R then
14:                          xt𝒩(x¯0,σt)similar-tosubscript𝑥𝑡𝒩superscriptsubscript¯𝑥0,subscript𝜎𝑡x_{t}\sim\mathcal{N}(\bar{x}_{0}^{,}\sigma_{t})italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT , end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
15:                    end if
16:             end for
17:       else
18:             x^t1Dθ(xt;σt,cx0)subscript^𝑥𝑡1subscript𝐷𝜃subscript𝑥𝑡subscript𝜎𝑡subscript𝑐subscript𝑥0{\hat{x}_{t-1}}\leftarrow D_{\theta}(x_{t};\sigma_{t},c_{x_{0}})over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ← italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
19:             xt1x¯0+σt1σt(xtx^0)subscript𝑥𝑡1subscript¯𝑥0subscript𝜎𝑡1subscript𝜎𝑡subscript𝑥𝑡subscript^𝑥0x_{t-1}\leftarrow{\color[rgb]{1,0,0}\bar{x}_{0}}+\frac{\sigma_{t-1}}{\sigma_{t% }}(x_{t}-\hat{x}_{0})italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ← over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
20:       end if
21:end for
22:return x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
Refer to caption
Figure 6: Input Video Warping. Given a single video, we utilize an off-the-shelf depth estimation model to generate warped frames from novel viewpoints.