(Translated by https://www.hiragana.jp/)
When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

Pingping Zhang1, Jinlong Li2, Kecheng Chen1, Meng Wang3, Long Xu4, Haoliang Li1, Nicu Sebe2, Sam Kwong3, Shiqi Wang1
Abstract

Traditional video compression methods perform well at high bitrates but struggle to preserve fine-grained semantic information at low bitrates. Recently, with the blossom of Multimodal Large Language Models (MLLMs), Cross-modal compression techniques offer prospective solutions for improving video compression under low-bitrate conditions. In this paper, we propose a unified Cross-Modality Video Coding (CMVC) framework that integrates multimodal representations and video generative models. The encoder disentangles video into spatial and temporal components, which are mapped to compact cross-modal representations using MLLMs. During decoding, different encoding-decoding modes are employed to acquire various video reconstruction quality, including Text-Text-to-Video (TT2V) for semantic preservation and Image-Text-to-Video (IT2V) for perceptual consistency. Additionally, we elaborate an efficient frame interpolation model using Low-Rank Adaptation (LoRA) to improve the perceptual quality. Experiments show that TT2V achieves effective semantic reconstruction, while IT2V ensures competitive perceptual consistency. These findings suggest the potential of leveraging multimodal priors to improve video compression, offering promising future research directions.

Introduction

Traditional video compression (Bhaskaran and Konstantinides 1997; Bross et al. 2021; Ma et al. 2019) has primarily focused on signal-level reconstruction, optimizing components such as pixel intensities and motion vectors during encoding and decoding. This approach has yielded substantial advancements in high-bitrate scenarios, exemplified by codecs like VVC (Bross et al. 2021) and DCVC (Li, Li, and Lu 2021), which deliver remarkable performance. However, these conventional methods often encounter limitations at low bitrates scenario, particularly when it comes to preserving fine-grained semantic information, crucial for applications such as disaster response and remote surveillance. In such contexts, reconstructed video is prone to the loss or blurring of key features.

To address these shortcomings at low bitrates, cross-modal compression (CMC) methods have emerged (Li et al. 2021; Zhang et al. 2023; Gao et al. 2024), which harness the complementary strengths of multiple modalities to enhance image representations under constrained bitrate conditions. Notable approaches, such as VR-CMC (Li et al. 2021), SCMC (Zhang et al. 2023), MCM (Gao et al. 2024), and CMC-Bench (Li et al. 2024), exploit the transformation of images into textual formats (I2T) to generate semantically similar content. This transformation not only preserves crucial semantic information but also achieves significant reductions in storage requirements, resulting in higher compression ratios compared to conventional image data representations. However, the potential of cross-modal representations for video compression, particularly in maintaining both semantic and perceptual quality at low bitrates, remains to be relatively underexplored.

Unlike image compression, which interests primarily in spatial information, video compression additionally involves the representation of temporal dynamics, significantly increasing its complexity. In this regard, Multimodal Large Language Models (MLLMs) (Han et al. 2022; Ruan et al. 2023) offer distinct advantages due to their inherent capacity to analyze and interpret temporal relationships within video content. Recent researches (Lin et al. 2023; Zhang, Li, and Bing 2023; Fei et al. 2024) have demonstrated that MLLMs excel at processing sequential data and capturing dependencies across temporal events. By leveraging these capabilities, MLLMs can generate compact and semantically rich textual representations of video content, facilitating efficient compression while preserving high-quality reconstruction, even in low-bitrate settings. As such, MLLMs present a promising paradigm for advancing video compression, offering an effective balance between encoding efficiency and semantic preservation.

Refer to caption
Figure 1: The framework of the proposed CMVC scheme. This framework operates by first segmenting the video into distinct clips using a keyframe selection strategy (a), allowing for the extraction of both spatial and temporal components from each video segment. Subsequently, MLLMs are employed to generate multimodal representations of these components. For instance, spatial information can be represented through text or images, while temporal dynamics may be encoded using text or audio modalities. These multimodal representations are then encoded via their respective encoders, resulting in compressed bitstreams for each component. The bitstreams corresponding to different components are then combined and transmitted to the decoder. In the decoder, we provide two exemplary modes, including TT2V (c) and IT2V (d) modes, for video generation. This model integrate various SoTA models and mode conversions while maintaining semantic and perceptual quality at relatively high compression ratios.

Motivated by the advantages of MLLMs, we propose a new Cross-Modality Video Coding (CMVC) scheme aimed at optimizing the representation of both spatial and temporal components, thereby enabling high-fidelity semantic and perceptual reconstruction at low bitrates. Capitalizing on the rich potential of multimodal representations, this framework supports the development of diverse encoding-decoding modes tailored to specific reconstruction requirements. We propose two exemplary modes alternatively: TT2V (Text-Text-to-Video) mode for semantic reconstruction at ultra-low bitrate (ULB) and IT2V (Image-Text-to-Video) mode for perceptual reconstruction at extremely low bitrate (ELB). In the TT2V mode, inspired by the workflow of Cross-Modality Image Compression (CMIC) (Li et al. 2021; Zhang et al. 2023; Li et al. 2024) and the stunning generation capability of existing TT2V models for video reconstruction, we first extract the representative text constructed with our selection strategy, effectively encoding video content as the spatial component and motion as the temporal component. Then, the video generation model is utilized to reconstruct the corresponding video from text inputs. The rationale behind this strategy lies that compact yet effective text representations from the encoder encapsulate semantic details to enable high-quality semantic reconstruction for the decoder. Different from TT2V, the IT2V mode is designed to enhance perceptual reconstruction, since images provide richer visual contexts compared to text, benefiting perceptual consistency. This is achieved by inputting similar text representations with the TT2V mode and extra selected keyframes from the encoder to the decoder for better perceptual video reconstruction. To further improve perceptual smoothness across consecutive frames, an efficient adaptation tuning in a frame-interpolation manner via Low-Rank Adaption (LoRA) tuning is tailored to fully exploit the semantic cues and visual contexts from both input texts and keyframes to facilitate high-quality perceptual consistency for video reconstruction. This comprehensive paradigm adeptly accommodates diverse modality representations within video coding, by tapping into foundational MLLMs and video generation models, which sheds light on future video coding works. The contributions of our work are as follows:

  • To the best of our knowledge, the proposed unified paradigm for CMVC is the first to leverage foundational MLLMs and video generation models for video coding.

  • We elaborate multiple encoding-decoding modes to achieve good trade-off video reconstruction quality for specified decoding requirements, including TT2V mode to ensure high-quality semantic information and IT2V mode to achieve superb perceptual consistency.

  • Extensive experiments demonstrate that the proposed CMVC pipeline obtains competitive video reconstructions on HEVC Class B, C, D, E, UVG and MCL-JCV benchmarks while maintaining high compression ratios.

Related works

Video Generation Models

Recently, video generation models have emerged as an increasingly promising topic, with numerous studies (Ho et al. 2022; Singer et al. 2023; Ge et al. 2023; Blattmann et al. 2023; Chen et al. 2024a; Wang et al. 2023a) showcasing promising advancements, that enables generative models simulate the real world principle. These include various approaches such as text-to-video (T2V) (Wang et al. 2023a, b), image-to-video (I2V) (Chen et al. 2024b; Esser et al. 2023; Yin et al. 2023), and IT2V (Zhang et al. 2024a), among others. T2V technology is designed to convert descriptive text into corresponding videos (Lin et al. 2023; Wang et al. 2023a; Zhang, Li, and Bing 2023). One of the primary challenges in this field is to understand the intricate semantics of the input text and effectively translate it into dynamic visual content, following real-world physics. To achieve optimal quality in the generated videos, these models are trained on large-scale video datasets, leveraging a large text-video corpus to train the model for better alignment. I2V generative models typically include methods such as video interpolation and image-driven video diffusion. Image-driven video diffusion models (Voleti et al. 2024; Chai et al. 2023; Ouyang et al. 2024) necessitate the given referring image to steer the generative model to produce corresponding videos. Compared to image-driven video diffusion models, video interpolation techniques (Huang et al. 2022; Li et al. 2023) can better maintain consistency in both resolution and motions in terms of moving objects across consecutive frames, which elucidates great potential utility for image-driven video generation. In contrast to I2V models, IT2V generative models incorporate textual guidance to enhance video generation. For instance, DiffMorpher (Zhang et al. 2024a) adds visual descriptions for images by adopting latent interpolation adaptation training to produce smoothing transformation. Building upon this concept, we can also enhance video generation by incorporating temporal descriptions.

Cross-Modality Compression

Multimodal generation has been effectively applied in the field of compression (Lu et al. 2022). To preserve semantic communication at ELB, CMC (Li et al. 2021) integrates the I2T translation model with the T2I (Text-to-Image) generation model. Based on this foundation, SCMC (Zhang et al. 2023) introduces a scalable cross-modality compression paradigm that hierarchically represents images across different modalities, thereby enhancing both semantic and signal-level fidelity. Subsequently, VR-CMC (Gao et al. 2024), a variable-rate cross-modal compression technique, employs variable-rate prompts to capture data at varying levels of granularity. Additionally, a CMC benchmark has been established for image compression (Li et al. 2024). These models demonstrate that the integration of I2T and T2I methodologies has outperformed the most advanced visual signal codecs. Despite these advancements, there remains limited research focused on cross-modality video coding.

CMVC Scheme

Overview

We propose a CMVC paradigm for efficient video compression with high semantic and perceptual quality, especially at low bitrates, as illustrated in Fig. 1. Given a video Vvi𝑉subscript𝑣𝑖V\in v_{i}italic_V ∈ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes video frames and i{1,,N}𝑖1𝑁i\in\{1,\dots,N\}italic_i ∈ { 1 , … , italic_N }, which consists of spatial (keyframe) and temporal (motion) components, the goal is to compress these components into compact multimodal representations. We leverage MLLMs, specifically V2T models, to map both keyframes and motion into textual representations, which are then encoded using specialized encoders. These multimodal representations are then compressed using dedicated encoders, yielding compressed representations of keyframe and motion. The video is reconstructed by a decoder operating in one of two modes: TT2V, which prioritizes semantic consistency, and IT2V, which focuses on perceptual quality. This approach enables high compression ratios while preserving both semantic information and perceptual quality.

CMVC Encoder

Keyframe selection strategy. Keyframes divide a full length video sequence into clips. Let n𝑛nitalic_n denote the number of keyframes, allowing us to extract n𝑛nitalic_n-1 clips from the video, with the first and last frames initially designated as keyframes. The first frame is encoded using the CLIP encoder (Radford et al. 2021) to extract a high-level feature vector vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT containing concise semantic information. We calculate the cosine similarity distance between the first frame and subsequent frames as follows:

𝒟c=ckck+ickck+i=j=1mck,jck+i,jj=1m(ck,j)2j=1m(ck+i,j)2,subscript𝒟𝑐subscript𝑐𝑘subscript𝑐𝑘𝑖normsubscript𝑐𝑘normsubscript𝑐𝑘𝑖superscriptsubscript𝑗1𝑚subscript𝑐𝑘𝑗subscript𝑐𝑘𝑖𝑗superscriptsubscript𝑗1𝑚superscriptsubscript𝑐𝑘𝑗2superscriptsubscript𝑗1𝑚superscriptsubscript𝑐𝑘𝑖𝑗2\centering\mathcal{D}_{c}=\frac{c_{k}\cdot c_{k+i}}{\left\|c_{k}\right\|\cdot% \left\|c_{k+i}\right\|}=\frac{\sum_{j=1}^{m}c_{k,j}\cdot c_{k+i,j}}{\sqrt{\sum% _{j=1}^{m}\left(c_{k,j}\right)^{2}\cdot\sum_{j=1}^{m}\left(c_{k+i,j}\right)^{2% }}},\@add@centeringcaligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ italic_c start_POSTSUBSCRIPT italic_k + italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ⋅ ∥ italic_c start_POSTSUBSCRIPT italic_k + italic_i end_POSTSUBSCRIPT ∥ end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT ⋅ italic_c start_POSTSUBSCRIPT italic_k + italic_i , italic_j end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_k + italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG , (1)

where ck+isubscript𝑐𝑘𝑖c_{k+i}italic_c start_POSTSUBSCRIPT italic_k + italic_i end_POSTSUBSCRIPT is the feature vectors extracted from the subsequent frames. m𝑚mitalic_m is the number of components of vectors cksubscript𝑐𝑘c_{k}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and ck+isubscript𝑐𝑘𝑖c_{k+i}italic_c start_POSTSUBSCRIPT italic_k + italic_i end_POSTSUBSCRIPT. Within a uniform interval, we select the frame with the smallest similarity to the previous keyframes to form a set of keyframes that better showcase significant motion. Then, the cksubscript𝑐𝑘c_{k}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is replaced by the next keyframe features, acting like a dynamic mechanism. This iterative process is repeated for subsequent clips, systematically identifying representative keyframes.

Multimodality representation. In our proposed scheme, we focus on efficiently representing spatial and temporal information of videos through keyframes and motion. Specifically, let V𝑉Vitalic_V represent the original video. The keyframes are denoted as K𝐾Kitalic_K = {k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, …, knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT}, where each kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a keyframe. The motion information mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT between two consecutive keyframes is represented by M𝑀Mitalic_M = {m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, …, mn1subscript𝑚𝑛1m_{n-1}italic_m start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT}. Keyframes and motion are transformed into multimodality representations as follows: Tk,i=f(ki)subscript𝑇𝑘𝑖𝑓subscript𝑘𝑖T_{k,i}=f(k_{i})italic_T start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT = italic_f ( italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and Tm,j=g(mj)subscript𝑇𝑚𝑗𝑔subscript𝑚𝑗T_{m,j}=g(m_{j})italic_T start_POSTSUBSCRIPT italic_m , italic_j end_POSTSUBSCRIPT = italic_g ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Here, f()𝑓f(*)italic_f ( ∗ ) and g()𝑔g(*)italic_g ( ∗ ) denote the process of cross-modality representation for keyframes and motion, respectively. As illustrated in Fig. 1, keyframes and motion can be transformed into textual and visual representations. Thus, the total bitrate is given by:

Rtotal=i=1nRk(Tk,i)+j=1n1Rm(Tm,j),subscript𝑅𝑡𝑜𝑡𝑎𝑙superscriptsubscript𝑖1𝑛subscript𝑅𝑘subscript𝑇𝑘𝑖superscriptsubscript𝑗1𝑛1subscript𝑅𝑚subscript𝑇𝑚𝑗R_{total}=\sum_{i=1}^{n}R_{k}(T_{k,i})+\sum_{j=1}^{n-1}R_{m}(T_{m,j}),italic_R start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_m , italic_j end_POSTSUBSCRIPT ) , (2)

where Rksubscript𝑅𝑘R_{k}italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Rmsubscript𝑅𝑚R_{m}italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are the entropy coding modules for keyframes and motion, respectively. The bitrates can be adjusted by n𝑛nitalic_n and the compression ratio of keyframes and motion.

CMVC Decoder

In the decoder, we utilize the decoded keyframe K^^𝐾\hat{K}over^ start_ARG italic_K end_ARG and motion M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG to achieve video generation, as follows:

V^=𝒢(K^,M^),^𝑉𝒢^𝐾^𝑀\hat{V}=\mathcal{G}(\hat{K},\hat{M}),over^ start_ARG italic_V end_ARG = caligraphic_G ( over^ start_ARG italic_K end_ARG , over^ start_ARG italic_M end_ARG ) , (3)

where 𝒢()𝒢\mathcal{G}(*)caligraphic_G ( ∗ ) is a video generation model and V^^𝑉\hat{V}over^ start_ARG italic_V end_ARG is the reconstructed video. Based on different modality representations for keyframe and motion, we designed two modes, including the TT2V mode and the IT2V mode for ULB and ELB coding, respectively.

In the TT2V mode, we utilize state-of-the-art (SoTA) video generation models to generate videos from decoded keyframes and motion descriptions. Leveraging advancements in these models, we optimize semantic reconstruction, with our results showing that more detailed descriptions yield higher bitrates and improved semantic quality. In the IT2V mode, keyframe images and motion descriptions are integrated to enhance perceptual quality. In addition to employing existing IT2V models, we propose a generative model utilizing LoRA tuning to ensure superior perceptual consistency at ELB.

The IT2V generative model.

Refer to caption
Figure 2: The workflow of the IT2V generative model. Two LoRAs are trained to fit the two keyframe images (I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), respectively. To generate w𝑤witalic_w-th frame between I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we interpolate Iwsubscriptsuperscript𝐼𝑤I^{{}^{\prime}}_{w}italic_I start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and the LoRA parameters according to the weights wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and wlsubscript𝑤𝑙w_{l}italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT.

The IT2V mode is designed to obtain a reconstructed video according to keyframe images and the text of motion. Thus, we propose a IT2V generative model, which generates a video clip according to two keyframe images (I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and the description of the motion of this video clip. Specifically, we adopt a stable diffusion model (SD) with LoRA, which fine-tunes the model parameters θ𝜃\thetaitalic_θ by training a low-rank residual component θ𝜃\triangle\theta△ italic_θ. This residual can be decomposed into products of low-rank matrices. LoRA demonstrates significant efficiency in generating various samples while maintaining consistent semantic identity across different latent noise traversals. The proposed IT2V model is shown in Fig. 2. We first train two LoRAs (Δθ0Δsubscript𝜃0\Delta\theta_{0}roman_Δ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and Δθ1Δsubscript𝜃1\Delta\theta_{1}roman_Δ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) on the SD UNet ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT for each of the two images I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The learning objective of Δθi(i=0,1)Δsubscript𝜃𝑖𝑖01\Delta\theta_{i}(i=0,1)roman_Δ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_i = 0 , 1 ) is:

(Δθi)=𝔼ϵ,t[ϵϵθ+Δθi(𝐳t,i,t,𝐜i)2],Δsubscript𝜃𝑖subscript𝔼italic-ϵ𝑡delimited-[]superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃Δsubscript𝜃𝑖subscript𝐳𝑡𝑖𝑡subscript𝐜𝑖2\mathcal{L}\left(\Delta\theta_{i}\right)=\mathbb{E}_{\epsilon,t}\left[\left\|% \epsilon-\epsilon_{\theta+\Delta\theta_{i}}\left(\mathbf{z}_{t,i},t,\mathbf{c}% _{i}\right)\right\|^{2}\right],caligraphic_L ( roman_Δ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_ϵ , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ + roman_Δ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , italic_t , bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (4)

where 𝐳t,i=α¯t𝐳^i+1α¯tϵsubscript𝐳𝑡𝑖subscript¯𝛼𝑡subscript^𝐳𝑖1subscript¯𝛼𝑡italic-ϵ\mathbf{z}_{t,i}=\sqrt{\bar{\alpha}_{t}}\hat{\mathbf{z}}_{i}+\sqrt{1-\bar{% \alpha}_{t}}\epsilonbold_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ is the noised latent embedding at diffusion step t𝑡titalic_t. 𝐳^isubscript^𝐳𝑖\hat{\mathbf{z}}_{i}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the VAE encoded latent of the Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT image. ϵ𝒩(𝟎,𝐈)similar-toitalic-ϵ𝒩0𝐈\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) is the random sampled Gaussian noise. 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the motion embedding encoded from the motion prompt. ϵθ+Δθisubscriptitalic-ϵ𝜃Δsubscript𝜃𝑖\epsilon_{\theta+\Delta\theta_{i}}italic_ϵ start_POSTSUBSCRIPT italic_θ + roman_Δ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents the LoRA-integrated UNet. The fine-tuning objective is optimized separately via gradient descent in Δθ0Δsubscript𝜃0\Delta\theta_{0}roman_Δ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and Δθ1Δsubscript𝜃1\Delta\theta_{1}roman_Δ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Refer to caption
Figure 3: Left: Comparison results of combination of different V2T models (VideoLLaVA and VideoLLaMA) and TT2V models (VideoCrafter1, VideoCrafter2, ModelScope, OpenSora and AnimateDiff). Right: Visual quality comparison of the TT2V mode and VTM. At ULB, our proposed TT2V mode successfully preserves the semantic quality of the videos. In contrast, VTM brings significant blocking artifacts, which impedes the effective conveyance of semantic information in videos.
Refer to caption
Figure 4: Visual quality comparison. The values represent the BPP (1e-2) and the DISTS value. A lower DISTS value indicates better perceptual quality.

Frame and model interpolation. In order to generate Iwsubscriptsuperscript𝐼𝑤I^{{}^{\prime}}_{w}italic_I start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, we first conduct keyframe interpolation as input in the following manner:

Iw=wi×I0+(1wi)×I1.superscriptsubscript𝐼𝑤subscript𝑤𝑖subscript𝐼01subscript𝑤𝑖subscript𝐼1I_{w}^{\prime}=w_{i}\times I_{0}+\left(1-w_{i}\right)\times I_{1}.italic_I start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) × italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (5)

Building upon DiffMorpher (Zhang et al. 2024a), we then interpolate the model weight ΔθlΔsubscript𝜃𝑙\Delta\theta_{l}roman_Δ italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT according to Δθ0Δsubscript𝜃0\Delta\theta_{0}roman_Δ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and Δθ1Δsubscript𝜃1\Delta\theta_{1}roman_Δ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT:

Δθl=wl×Δθ0+(1wl)×Δθ1.Δsubscript𝜃𝑙subscript𝑤𝑙Δsubscript𝜃01subscript𝑤𝑙Δsubscript𝜃1\Delta\theta_{l}=w_{l}\times\Delta\theta_{0}+\left(1-w_{l}\right)\times\Delta% \theta_{1}.roman_Δ italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × roman_Δ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) × roman_Δ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (6)

ΔθlΔsubscript𝜃𝑙\Delta\theta_{l}roman_Δ italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the LoRA parameters, which are integrated to UNet ϵθ+Δθlsubscriptitalic-ϵ𝜃Δsubscript𝜃𝑙\epsilon_{\theta+\Delta\theta_{l}}italic_ϵ start_POSTSUBSCRIPT italic_θ + roman_Δ italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

The uniformly linear interpolation schedule may result in an uneven transition. Thus, we conduct online training for the wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and wlsubscript𝑤𝑙w_{l}italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT with the constraint of D(Iw,I^w)𝐷subscript𝐼𝑤subscript^𝐼𝑤D(I_{w},\hat{I}_{w})italic_D ( italic_I start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) at the encoder side, where D()𝐷D(*)italic_D ( ∗ ) is the 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss. We only update the wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and wlsubscript𝑤𝑙w_{l}italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, as follows:

wit+1=witαD(wit),subscriptsuperscript𝑤𝑡1𝑖subscriptsuperscript𝑤𝑡𝑖𝛼𝐷subscriptsuperscript𝑤𝑡𝑖w^{t+1}_{i}=w^{t}_{i}-\alpha\nabla D(w^{t}_{i}),italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_α ∇ italic_D ( italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (7)
wlt+1=wltαD(wlt),subscriptsuperscript𝑤𝑡1𝑙subscriptsuperscript𝑤𝑡𝑙𝛼𝐷subscriptsuperscript𝑤𝑡𝑙w^{t+1}_{l}=w^{t}_{l}-\alpha\nabla D(w^{t}_{l}),italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_α ∇ italic_D ( italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , (8)

where witsubscriptsuperscript𝑤𝑡𝑖w^{t}_{i}italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and wltsubscriptsuperscript𝑤𝑡𝑙w^{t}_{l}italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the parameter wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and wlsubscript𝑤𝑙w_{l}italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT at training step t𝑡titalic_t, resepctively. α𝛼\alphaitalic_α is the learning rate, set to 0.001, while D()𝐷\nabla D(*)∇ italic_D ( ∗ ) denotes the gradient of the loss function concerning the parameters at the training step t𝑡titalic_t. After obtaining optimal witsubscriptsuperscript𝑤𝑡𝑖w^{t}_{i}italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and wltsubscriptsuperscript𝑤𝑡𝑙w^{t}_{l}italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, we compress and transmit them to the decoder. The VAE decoder then reconstructs the denoised latent representation into the w𝑤witalic_w-th frame, resulting in I^wsubscript^𝐼𝑤\hat{I}_{w}over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT.

Experiments

Experimental settings

Datasets. The datasets, including HEVC Class B, C, D, and E, as well as UVG and MCL-JCV, are extensively utilized for evaluating both traditional and neural video codecs. These datasets vary in resolution and content, providing a diverse range of scenarios for comprehensive assessment. To ensure compatibility with various video codecs, we resize videos to dimensions that are multiples of 64 for both width and height.

Refer to caption
Figure 5: The R-D performance comparison in the IT2V mode. The comparisons are performed on the Class B, Class C, Class D, Class E, UVG, and MCL-JCV, respectively.
Table 1: BD-Rate (%) comparison of different video generation models across various datasets in terms of DISTS. The anchor is VTM with QP={52, 50, 47, 45}.
RIFE AMT DiffMorpher Ours
Class B 4.32 53.97 -45.94 -59.12
Class C 42.58 50.57 -31.06 -17.27
Class D -4.89 -29.88 -49.06 -52.16
UVG 24.08 15.00 11.55 -21.18
MCL-JCV 83.08 151.20 22.16 -7.25

Comparison methods. There are numerous SOTA foundation models available for video understanding. We choose two prominent models, namely VideoLLaVA (Lin et al. 2023) and VideoLLaMA (Zhang, Li, and Bing 2023), to extract semantic information from videos. This process aligns with the V2T stage depicted in Fig. 1, where the selected models play an important role in extracting semantic descriptions for keyframes and motion. In the TT2V mode, numerous video generation models are available. In this context, we employ advanced video generation models, including Open-Sora (Lab and etc. 2024), VideoCrafter1 (Chen et al. 2023), VideoCrafter2 (Chen et al. 2024a), and AnimateDiff (Guo et al. 2024), for the purpose of generating videos based on textual input. In addition, we compare with the video codec VTM at the extremely low bitrate with QP=63. In the IT2V mode, we conduct a comparative analysis of existing traditional video codecs, such as x264, x265, and VTM (Bross et al. 2021). Alongside this, we evaluate our method against deep video codecs such as DCVC (Li, Li, and Lu 2021) and DCVC-DC (Li, Li, and Lu 2023), but these codecs encounter challenges in achieving extremely low bitrate coding. In addition, we compare the video generation technique, DiffMorpher (Zhang et al. 2024b), which requires keyframe images and motion descriptions for controlling video generation. In our exploration of various video interpolation methods, it is essential to note that these approaches rely solely on keyframes for control, omitting any incorporation of motion descriptions. Furthermore, it should be emphasized that the bit consumption associated with motion text has not been calculated.

Refer to caption
Figure 6: Visual quality comparison in a video. The numbers displayed beneath the images correspond to the frame index.
Refer to caption
Figure 7: Visual quality comparison between the TT2V mode and the IT2V mode. The TT2V mode effectively preserves semantic consistency with the ground truth, while the IT2V mode is designed to keep perceptual consistency.

Experimental results

Comparison in the TT2V mode. We conduct a comparative analysis of two V2T models, VideoLLaVA and VideoLLaMA, both of which are the SOTA MLLMs. Subsequently, we extend our comparison to include five video generation models: VideoCrafter1, VideoCrafter2, ModelScope, OpenSora, and AnimateDiff. Furthermore, we compare these models against the traditional video codec VTM at QP=63, which results in a higher bitrate than our proposed scheme. Our assessment focuses on five aspects: subject consistency, background consistency, temporal flickering, motion smoothness, and frame quality. The results, illustrated in Fig. 3, indicate that the TT2V generation models outperformed VTM, showcasing better frame-wise quality and consistency in both background and subject representation. These results reflect the average performance across all testing datasets, and detailed comparison results can be found in the supplementary material. The visual quality comparisons illustrated in Fig. 3 indicate that VTM displays considerable blocking artifacts, which severely hinder its ability to convey semantic information.

Table 2: Ablation studies on the keyframe selection strategy.
Models Settings BD-Rate(%) \downarrow
Sampling strategies Uniform sampling -19.705
Random sampling -9.859
MSE -14.496
CS -24.206
Keyframe number 2 -20.001
3 -5.104
4 12.665
Keyframe quality low -24.206
middle -12.043
high -2.617

Comparison in the IT2V mode. We compare our model with traditional codecs (x264, x265, and VTM) as well as deep video codecs (DCVC and DCVC-DC). As presented in Fig. 5, we use DISTS to evaluate perceptual quality. Additional comparisons with other evaluation metrics, such as LPIPS, FID, and PSNR, are provided in the supplementary material. However, the pretrained models provided by deep video codecs have limitations in achieving ELB. In addition, we compare our proposed model with various video generation models, including RIFE (Huang et al. 2022), AMT (Li et al. 2023), and DiffMorpher (Zhang et al. 2024a), as detailed in Table 1. By adjusting the number and quality of keyframe images, we can effectively control the bitrate. For our comparisons, we select the optimal results for comparison, where the settings can be found in the supplementary material. Our model exhibits superior performance across most datasets, demonstrating greater stability compared to other video generation models. The visual quality is evaluated at similar bitrates, as shown in Fig. 4 and Fig. 6. Our proposed model exhibits superior perceptual quality in both spatial and temporal dimensions. Additionally, we showcase frames sampled from the decoded videos generated by the TT2V mode and the IT2V mode. The TT2V mode effectively preserves semantic consistency with the ground truth, while the IT2V mode further ensures perceptual consistency.

Ablation studies

Keyframe. We perform ablation studies focused on keyframes, examining various aspects such as keyframe selection methods, the quality of keyframe images, and the number of keyframe images. In our keyframe selection process, we evaluate various sampling strategies, including uniform sampling and random sampling. Given that these techniques do not rely on a distance function, we also compare the sampling strategy with mean-square error (MSE) distance to Ours with cosine similarity (CS) distance. Regarding the quality of keyframe images, we varied the quality levels, including low, medium, and high quality, which correspond to the compression factors 64, 128, and 256, respectively. The results presented in Table 2 indicate that higher quality decoded images result in increased bitrate consumption, such that higher quality does not necessarily lead to a better BD-Rate. Adjusting the number of keyframes based on the frame number of the video, we observe that a lower number of keyframes can maintain a balance between quality and bitrate consumption.

IT2V generative model. We perform ablation studies focused on the different settings, including the influence of motion description, different codecs, updating strategies, training step, and sampling step. In terms of motion description, we compare the model without motion description, as depicted in Table 3. The results indicate that incorporating motion description significantly enhances video reconstruction quality. Additionally, we explore a range of codecs for keyframe images, such as Hyperprior (Ballé et al. 2018), NIC (Chen et al. 2021), and NVTC (Feng et al. 2023). Among these, NVTC stands out by demonstrating superior reconstructed quality while maintaining a lower coding rate. Our model requires updating wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and wlsubscript𝑤𝑙w_{l}italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT based on the input, such that we further evaluate the effectiveness of updating strategies, as illustrated in Table 3. To assess the effectiveness of these updating strategies, we present further evaluations in Table 3. Moreover, we examine the repercussions of varying training and sampling steps. An increase in the number of sampling steps correlates with improved results. To strike a balance between performance and computational efficiency, we choose 100 training steps and 50 sampling steps for our final implementation.

Table 3: Ablation studies of the IT2V mode.
Models Settings BD-Rate(%) \downarrow
Motion description ×\times× 8.516
\checkmark -24.206
Codecs Hyperprior -12.261
NIC -16.132
NTVP -24.206
Updating wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ×\times× -12.402
Updating wlsubscript𝑤𝑙w_{l}italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ×\times× -18.551
Updating wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and wlsubscript𝑤𝑙w_{l}italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ×\times× -2.351
Updating wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and wlsubscript𝑤𝑙w_{l}italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT \checkmark -24.206
Training step 50 -17.864
100 -24.206
150 -25.196
Sampling step 20 -4.035
50 -24.206
100 -25.207

Discussion

Application. The proposed method enables efficient transmission at low bitrates while preserving semantic content, making it ideal for bandwidth-limited or emergency alert scenarios. When bandwidth allows, keyframe data can be transmitted with textual descriptions, allowing the decoder to reconstruct an approximate video. This hybrid approach balances visual quality with data efficiency, suitable for situations where full video streaming is infeasible, such as disaster response and remote surveillance. Although still in the research phase, the method shows strong potential for real-world applications. With advances in computational resources and model optimization techniques like pruning and quantization, it is expected to become a practical solution for emergency communications and other bandwidth-constrained environments.

Higher bitrate. CMVC includes the TT2V mode and IT2V mode, but more modes can be further explored. For instance, motion representation can be realized through optical flow or trajectories. By integrating multiple modalities of keyframes and motion, we can cater to diverse reconstruction requirements. Furthermore, future efforts should prioritize enhancing CMVC at higher bitrates by integrating more control information to facilitate the reconstruction of the original video. This approach aims to achieve superior performance across all bitrates and dimensions when compared to traditional codecs.

Conclusion

We propose a CMVC paradigm that represents a promising advancement in video coding technology. This framework effectively tackles the challenges of preserving semantic integrity and perceptual consistency at ULB and ELB. By leveraging MLLMs and cross-modality representation techniques, the proposed CMVC framework disentangles videos into content and motion components, transforming them into different modalities for efficient compression and reconstruction. Through the TT2V and IT2V modes, CMVC achieves a balance between semantic information and perceptual quality, offering a comprehensive solution at high compression ratios.

References

  • Ballé et al. (2018) Ballé, J.; Minnen, D.; Singh, S.; Hwang, S. J.; and Johnston, N. 2018. Variational image compression with a scale hyperprior. In ICLR.
  • Bhaskaran and Konstantinides (1997) Bhaskaran, V.; and Konstantinides, K. 1997. Image and video compression standards: algorithms and architectures. Springer Science & Business Media.
  • Blattmann et al. (2023) Blattmann, A.; Rombach, R.; Ling, H.; Dockhorn, T.; Kim, S. W.; Fidler, S.; and Kreis, K. 2023. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 22563–22575.
  • Bross et al. (2021) Bross, B.; Wang, Y.-K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G. J.; and Ohm, J.-R. 2021. Overview of the Versatile Video Coding (VVC) Standard and its Applications. T-CSVT, 31(10): 3736–3764.
  • Chai et al. (2023) Chai, W.; Guo, X.; Wang, G.; and Lu, Y. 2023. Stablevideo: Text-driven consistency-aware diffusion video editing. In ICCV, 23040–23050.
  • Chen et al. (2023) Chen, H.; Xia, M.; He, Y.; Zhang, Y.; Cun, X.; Yang, S.; Xing, J.; Liu, Y.; Chen, Q.; Wang, X.; Weng, C.; and Shan, Y. 2023. VideoCrafter1: Open Diffusion Models for High-Quality Video Generation. arXiv preprint arXiv:2310.19512.
  • Chen et al. (2024a) Chen, H.; Zhang, Y.; Cun, X.; Xia, M.; Wang, X.; Weng, C.; and Shan, Y. 2024a. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In CVPR, 7310–7320.
  • Chen et al. (2021) Chen, T.; Liu, H.; Ma, Z.; Shen, Q.; Cao, X.; and Wang, Y. 2021. End-to-End Learnt Image Compression via Non-Local Attention Optimization and Improved Context Modeling. TIP, 30: 3179–3191.
  • Chen et al. (2024b) Chen, X.; Wang, Y.; Zhang, L.; Zhuang, S.; Ma, X.; Yu, J.; Wang, Y.; Lin, D.; Qiao, Y.; and Liu, Z. 2024b. SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction. In ICLR.
  • Esser et al. (2023) Esser, P.; Chiu, J.; Atighehchian, P.; Granskog, J.; and Germanidis, A. 2023. Structure and content-guided video synthesis with diffusion models. In ICCV, 7346–7356.
  • Fei et al. (2024) Fei, H.; Wu, S.; Ji, W.; Zhang, H.; Zhang, M.; Lee, M.-L.; and Hsu, W. 2024. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Forty-first International Conference on Machine Learning.
  • Feng et al. (2023) Feng, R.; Guo, Z.; Li, W.; and Chen, Z. 2023. NVTC: Nonlinear Vector Transform Coding. In CVPR, 6101–6110.
  • Gao et al. (2024) Gao, J.; Li, J.; Jia, C.; Wang, S.; Ma, S.; and Gao, W. 2024. Cross Modal Compression With Variable Rate Prompt. TMM, 26: 3444–3456.
  • Ge et al. (2023) Ge, S.; Nah, S.; Liu, G.; Poon, T.; Tao, A.; Catanzaro, B.; Jacobs, D.; Huang, J.-B.; Liu, M.-Y.; and Balaji, Y. 2023. Preserve your own correlation: A noise prior for video diffusion models. In ICCV, 22930–22941.
  • Guo et al. (2024) Guo, Y.; Yang, C.; Rao, A.; Wang, Y.; Qiao, Y.; Lin, D.; and Dai, B. 2024. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In ICLR.
  • Han et al. (2022) Han, L.; Ren, J.; Lee, H.-Y.; Barbieri, F.; Olszewski, K.; Minaee, S.; Metaxas, D.; and Tulyakov, S. 2022. Show me what and tell me how: Video synthesis via multimodal conditioning. In CVPR, 3615–3625.
  • Ho et al. (2022) Ho, J.; Chan, W.; Saharia, C.; Whang, J.; Gao, R.; Gritsenko, A.; Kingma, D. P.; Poole, B.; Norouzi, M.; Fleet, D. J.; et al. 2022. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303.
  • Huang et al. (2022) Huang, Z.; Zhang, T.; Heng, W.; Shi, B.; and Zhou, S. 2022. Real-Time Intermediate Flow Estimation for Video Frame Interpolation. In ECCV.
  • Lab and etc. (2024) Lab, P.-Y.; and etc., T. A. 2024. Open-Sora-Plan. In GitHub.
  • Li et al. (2024) Li, C.; Wu, X.; Wu, H.; Feng, D.; Zhang, Z.; Lu, G.; Min, X.; Liu, X.; Zhai, G.; and Lin, W. 2024. CMC-Bench: Towards a New Paradigm of Visual Signal Compression. arXiv preprint arXiv:2406.09356.
  • Li et al. (2021) Li, J.; Jia, C.; Zhang, X.; Ma, S.; and Gao, W. 2021. Cross Modal Compression: Towards Human-comprehensible Semantic Compression. In ACM MM, 4230–4238.
  • Li, Li, and Lu (2021) Li, J.; Li, B.; and Lu, Y. 2021. Deep Contextual Video Compression. In NeurIPS, volume 34.
  • Li, Li, and Lu (2023) Li, J.; Li, B.; and Lu, Y. 2023. Neural Video Compression with Diverse Contexts. In CVPR.
  • Li et al. (2023) Li, Z.; Zhu, Z.-L.; Han, L.-H.; Hou, Q.; Guo, C.-L.; and Cheng, M.-M. 2023. AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation. In CVPR.
  • Lin et al. (2023) Lin, B.; Zhu, B.; Ye, Y.; Ning, M.; Jin, P.; and Yuan, L. 2023. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
  • Lu et al. (2022) Lu, G.; Zhong, T.; Geng, J.; Hu, Q.; and Xu, D. 2022. Learning based multi-modality image and video compression. In CVPR, 6083–6092.
  • Ma et al. (2019) Ma, S.; Zhang, X.; Jia, C.; Zhao, Z.; Wang, S.; and Wang, S. 2019. Image and video compression with neural networks: A review. IEEE Transactions on Circuits and Systems for Video Technology, 30(6): 1683–1698.
  • Ouyang et al. (2024) Ouyang, H.; Wang, Q.; Xiao, Y.; Bai, Q.; Zhang, J.; Zheng, K.; Zhou, X.; Chen, Q.; and Shen, Y. 2024. Codef: Content deformation fields for temporally consistent video processing. In CVPR, 8089–8099.
  • Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML.
  • Ruan et al. (2023) Ruan, L.; Ma, Y.; Yang, H.; He, H.; Liu, B.; Fu, J.; Yuan, N. J.; Jin, Q.; and Guo, B. 2023. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In CVPR, 10219–10228.
  • Singer et al. (2023) Singer, U.; Polyak, A.; Hayes, T.; Yin, X.; An, J.; Zhang, S.; Hu, Q.; Yang, H.; Ashual, O.; Gafni, O.; et al. 2023. Make-a-video: Text-to-video generation without text-video data. In ICLR.
  • Voleti et al. (2024) Voleti, V.; Yao, C.-H.; Boss, M.; Letts, A.; Pankratz, D.; Tochilkin, D.; Laforte, C.; Rombach, R.; and Jampani, V. 2024. Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion. arXiv preprint arXiv:2403.12008.
  • Wang et al. (2023a) Wang, J.; Yuan, H.; Chen, D.; Zhang, Y.; Wang, X.; and Zhang, S. 2023a. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571.
  • Wang et al. (2023b) Wang, Y.; Chen, X.; Ma, X.; Zhou, S.; Huang, Z.; Wang, Y.; Yang, C.; He, Y.; Yu, J.; Yang, P.; et al. 2023b. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103.
  • Yin et al. (2023) Yin, S.; Wu, C.; Liang, J.; Shi, J.; Li, H.; Ming, G.; and Duan, N. 2023. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089.
  • Zhang, Li, and Bing (2023) Zhang, H.; Li, X.; and Bing, L. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
  • Zhang et al. (2024a) Zhang, K.; Zhou, Y.; Xu, X.; Dai, B.; and Pan, X. 2024a. DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing. In CVPR, 7912–7921.
  • Zhang et al. (2024b) Zhang, K.; Zhou, Y.; Xu, X.; Dai, B.; and Pan, X. 2024b. DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing. In CVPR, 7912–7921.
  • Zhang et al. (2023) Zhang, P.; Wang, S.; Wang, M.; Li, J.; Wang, X.; and Kwong, S. 2023. Rethinking Semantic Image Compression: Scalable Representation with Cross-modality Transfer. T-CSVT.