When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

Pingping Zhang¹, Jinlong Li², Kecheng Chen¹, Meng Wang³, Long Xu⁴, Haoliang Li¹, Nicu Sebe², Sam Kwong³, Shiqi Wang¹

Abstract

Traditional video compression methods perform well at high bitrates but struggle to preserve fine-grained semantic information at low bitrates. Recently, with the blossom of Multimodal Large Language Models (MLLMs), Cross-modal compression techniques offer prospective solutions for improving video compression under low-bitrate conditions. In this paper, we propose a unified Cross-Modality Video Coding (CMVC) framework that integrates multimodal representations and video generative models. The encoder disentangles video into spatial and temporal components, which are mapped to compact cross-modal representations using MLLMs. During decoding, different encoding-decoding modes are employed to acquire various video reconstruction quality, including Text-Text-to-Video (TT2V) for semantic preservation and Image-Text-to-Video (IT2V) for perceptual consistency. Additionally, we elaborate an efficient frame interpolation model using Low-Rank Adaptation (LoRA) to improve the perceptual quality. Experiments show that TT2V achieves effective semantic reconstruction, while IT2V ensures competitive perceptual consistency. These findings suggest the potential of leveraging multimodal priors to improve video compression, offering promising future research directions.

Introduction

Traditional video compression (Bhaskaran and Konstantinides 1997; Bross et al. 2021; Ma et al. 2019) has primarily focused on signal-level reconstruction, optimizing components such as pixel intensities and motion vectors during encoding and decoding. This approach has yielded substantial advancements in high-bitrate scenarios, exemplified by codecs like VVC (Bross et al. 2021) and DCVC (Li, Li, and Lu 2021), which deliver remarkable performance. However, these conventional methods often encounter limitations at low bitrates scenario, particularly when it comes to preserving fine-grained semantic information, crucial for applications such as disaster response and remote surveillance. In such contexts, reconstructed video is prone to the loss or blurring of key features.

To address these shortcomings at low bitrates, cross-modal compression (CMC) methods have emerged (Li et al. 2021; Zhang et al. 2023; Gao et al. 2024), which harness the complementary strengths of multiple modalities to enhance image representations under constrained bitrate conditions. Notable approaches, such as VR-CMC (Li et al. 2021), SCMC (Zhang et al. 2023), MCM (Gao et al. 2024), and CMC-Bench (Li et al. 2024), exploit the transformation of images into textual formats (I2T) to generate semantically similar content. This transformation not only preserves crucial semantic information but also achieves significant reductions in storage requirements, resulting in higher compression ratios compared to conventional image data representations. However, the potential of cross-modal representations for video compression, particularly in maintaining both semantic and perceptual quality at low bitrates, remains to be relatively underexplored.

Unlike image compression, which interests primarily in spatial information, video compression additionally involves the representation of temporal dynamics, significantly increasing its complexity. In this regard, Multimodal Large Language Models (MLLMs) (Han et al. 2022; Ruan et al. 2023) offer distinct advantages due to their inherent capacity to analyze and interpret temporal relationships within video content. Recent researches (Lin et al. 2023; Zhang, Li, and Bing 2023; Fei et al. 2024) have demonstrated that MLLMs excel at processing sequential data and capturing dependencies across temporal events. By leveraging these capabilities, MLLMs can generate compact and semantically rich textual representations of video content, facilitating efficient compression while preserving high-quality reconstruction, even in low-bitrate settings. As such, MLLMs present a promising paradigm for advancing video compression, offering an effective balance between encoding efficiency and semantic preservation.

Refer to caption — Figure 1: The framework of the proposed CMVC scheme. This framework operates by first segmenting the video into distinct clips using a keyframe selection strategy (a), allowing for the extraction of both spatial and temporal components from each video segment. Subsequently, MLLMs are employed to generate multimodal representations of these components. For instance, spatial information can be represented through text or images, while temporal dynamics may be encoded using text or audio modalities. These multimodal representations are then encoded via their respective encoders, resulting in compressed bitstreams for each component. The bitstreams corresponding to different components are then combined and transmitted to the decoder. In the decoder, we provide two exemplary modes, including TT2V (c) and IT2V (d) modes, for video generation. This model integrate various SoTA models and mode conversions while maintaining semantic and perceptual quality at relatively high compression ratios.

Motivated by the advantages of MLLMs, we propose a new Cross-Modality Video Coding (CMVC) scheme aimed at optimizing the representation of both spatial and temporal components, thereby enabling high-fidelity semantic and perceptual reconstruction at low bitrates. Capitalizing on the rich potential of multimodal representations, this framework supports the development of diverse encoding-decoding modes tailored to specific reconstruction requirements. We propose two exemplary modes alternatively: TT2V (Text-Text-to-Video) mode for semantic reconstruction at ultra-low bitrate (ULB) and IT2V (Image-Text-to-Video) mode for perceptual reconstruction at extremely low bitrate (ELB). In the TT2V mode, inspired by the workflow of Cross-Modality Image Compression (CMIC) (Li et al. 2021; Zhang et al. 2023; Li et al. 2024) and the stunning generation capability of existing TT2V models for video reconstruction, we first extract the representative text constructed with our selection strategy, effectively encoding video content as the spatial component and motion as the temporal component. Then, the video generation model is utilized to reconstruct the corresponding video from text inputs. The rationale behind this strategy lies that compact yet effective text representations from the encoder encapsulate semantic details to enable high-quality semantic reconstruction for the decoder. Different from TT2V, the IT2V mode is designed to enhance perceptual reconstruction, since images provide richer visual contexts compared to text, benefiting perceptual consistency. This is achieved by inputting similar text representations with the TT2V mode and extra selected keyframes from the encoder to the decoder for better perceptual video reconstruction. To further improve perceptual smoothness across consecutive frames, an efficient adaptation tuning in a frame-interpolation manner via Low-Rank Adaption (LoRA) tuning is tailored to fully exploit the semantic cues and visual contexts from both input texts and keyframes to facilitate high-quality perceptual consistency for video reconstruction. This comprehensive paradigm adeptly accommodates diverse modality representations within video coding, by tapping into foundational MLLMs and video generation models, which sheds light on future video coding works. The contributions of our work are as follows:

•

To the best of our knowledge, the proposed unified paradigm for CMVC is the first to leverage foundational MLLMs and video generation models for video coding.
•

We elaborate multiple encoding-decoding modes to achieve good trade-off video reconstruction quality for specified decoding requirements, including TT2V mode to ensure high-quality semantic information and IT2V mode to achieve superb perceptual consistency.
•

Extensive experiments demonstrate that the proposed CMVC pipeline obtains competitive video reconstructions on HEVC Class B, C, D, E, UVG and MCL-JCV benchmarks while maintaining high compression ratios.

Related works

Video Generation Models

Recently, video generation models have emerged as an increasingly promising topic, with numerous studies (Ho et al. 2022; Singer et al. 2023; Ge et al. 2023; Blattmann et al. 2023; Chen et al. 2024a; Wang et al. 2023a) showcasing promising advancements, that enables generative models simulate the real world principle. These include various approaches such as text-to-video (T2V) (Wang et al. 2023a, b), image-to-video (I2V) (Chen et al. 2024b; Esser et al. 2023; Yin et al. 2023), and IT2V (Zhang et al. 2024a), among others. T2V technology is designed to convert descriptive text into corresponding videos (Lin et al. 2023; Wang et al. 2023a; Zhang, Li, and Bing 2023). One of the primary challenges in this field is to understand the intricate semantics of the input text and effectively translate it into dynamic visual content, following real-world physics. To achieve optimal quality in the generated videos, these models are trained on large-scale video datasets, leveraging a large text-video corpus to train the model for better alignment. I2V generative models typically include methods such as video interpolation and image-driven video diffusion. Image-driven video diffusion models (Voleti et al. 2024; Chai et al. 2023; Ouyang et al. 2024) necessitate the given referring image to steer the generative model to produce corresponding videos. Compared to image-driven video diffusion models, video interpolation techniques (Huang et al. 2022; Li et al. 2023) can better maintain consistency in both resolution and motions in terms of moving objects across consecutive frames, which elucidates great potential utility for image-driven video generation. In contrast to I2V models, IT2V generative models incorporate textual guidance to enhance video generation. For instance, DiffMorpher (Zhang et al. 2024a) adds visual descriptions for images by adopting latent interpolation adaptation training to produce smoothing transformation. Building upon this concept, we can also enhance video generation by incorporating temporal descriptions.

Cross-Modality Compression

Multimodal generation has been effectively applied in the field of compression (Lu et al. 2022). To preserve semantic communication at ELB, CMC (Li et al. 2021) integrates the I2T translation model with the T2I (Text-to-Image) generation model. Based on this foundation, SCMC (Zhang et al. 2023) introduces a scalable cross-modality compression paradigm that hierarchically represents images across different modalities, thereby enhancing both semantic and signal-level fidelity. Subsequently, VR-CMC (Gao et al. 2024), a variable-rate cross-modal compression technique, employs variable-rate prompts to capture data at varying levels of granularity. Additionally, a CMC benchmark has been established for image compression (Li et al. 2024). These models demonstrate that the integration of I2T and T2I methodologies has outperformed the most advanced visual signal codecs. Despite these advancements, there remains limited research focused on cross-modality video coding.

CMVC Scheme

Overview

We propose a CMVC paradigm for efficient video compression with high semantic and perceptual quality, especially at low bitrates, as illustrated in Fig. 1. Given a video $V\in v_{i}$ , where $v_{i}$ denotes video frames and $i\in\{1,\dots,N\}$ , which consists of spatial (keyframe) and temporal (motion) components, the goal is to compress these components into compact multimodal representations. We leverage MLLMs, specifically V2T models, to map both keyframes and motion into textual representations, which are then encoded using specialized encoders. These multimodal representations are then compressed using dedicated encoders, yielding compressed representations of keyframe and motion. The video is reconstructed by a decoder operating in one of two modes: TT2V, which prioritizes semantic consistency, and IT2V, which focuses on perceptual quality. This approach enables high compression ratios while preserving both semantic information and perceptual quality.

CMVC Encoder

Keyframe selection strategy. Keyframes divide a full length video sequence into clips. Let $n$ denote the number of keyframes, allowing us to extract $n$ -1 clips from the video, with the first and last frames initially designated as keyframes. The first frame is encoded using the CLIP encoder (Radford et al. 2021) to extract a high-level feature vector $v_{k}$ containing concise semantic information. We calculate the cosine similarity distance between the first frame and subsequent frames as follows:

\centering\mathcal{D}_{c}=\frac{c_{k}\cdot c_{k+i}}{\left\|c_{k}\right\|\cdot% \left\|c_{k+i}\right\|}=\frac{\sum_{j=1}^{m}c_{k,j}\cdot c_{k+i,j}}{\sqrt{\sum% _{j=1}^{m}\left(c_{k,j}\right)^{2}\cdot\sum_{j=1}^{m}\left(c_{k+i,j}\right)^{2% }}},\@add@centering

(1)

where $c_{k+i}$ is the feature vectors extracted from the subsequent frames. $m$ is the number of components of vectors $c_{k}$ and $c_{k+i}$ . Within a uniform interval, we select the frame with the smallest similarity to the previous keyframes to form a set of keyframes that better showcase significant motion. Then, the $c_{k}$ is replaced by the next keyframe features, acting like a dynamic mechanism. This iterative process is repeated for subsequent clips, systematically identifying representative keyframes.

Multimodality representation. In our proposed scheme, we focus on efficiently representing spatial and temporal information of videos through keyframes and motion. Specifically, let $V$ represent the original video. The keyframes are denoted as $K$ = { $k_{1}$ , $k_{2}$ , …, $k_{n}$ }, where each $k_{i}$ is a keyframe. The motion information $m_{j}$ between two consecutive keyframes is represented by $M$ = { $m_{1}$ , $m_{2}$ , …, $m_{n-1}$ }. Keyframes and motion are transformed into multimodality representations as follows: $T_{k,i}=f(k_{i})$ and $T_{m,j}=g(m_{j})$ . Here, $f(*)$ and $g(*)$ denote the process of cross-modality representation for keyframes and motion, respectively. As illustrated in Fig. 1, keyframes and motion can be transformed into textual and visual representations. Thus, the total bitrate is given by:

R_{total}=\sum_{i=1}^{n}R_{k}(T_{k,i})+\sum_{j=1}^{n-1}R_{m}(T_{m,j}),

(2)

where $R_{k}$ and $R_{m}$ are the entropy coding modules for keyframes and motion, respectively. The bitrates can be adjusted by $n$ and the compression ratio of keyframes and motion.

CMVC Decoder

In the decoder, we utilize the decoded keyframe $\hat{K}$ and motion $\hat{M}$ to achieve video generation, as follows:

\hat{V}=\mathcal{G}(\hat{K},\hat{M}),

(3)

where $\mathcal{G}(*)$ is a video generation model and $\hat{V}$ is the reconstructed video. Based on different modality representations for keyframe and motion, we designed two modes, including the TT2V mode and the IT2V mode for ULB and ELB coding, respectively.

In the TT2V mode, we utilize state-of-the-art (SoTA) video generation models to generate videos from decoded keyframes and motion descriptions. Leveraging advancements in these models, we optimize semantic reconstruction, with our results showing that more detailed descriptions yield higher bitrates and improved semantic quality. In the IT2V mode, keyframe images and motion descriptions are integrated to enhance perceptual quality. In addition to employing existing IT2V models, we propose a generative model utilizing LoRA tuning to ensure superior perceptual consistency at ELB.

The IT2V generative model.

The IT2V mode is designed to obtain a reconstructed video according to keyframe images and the text of motion. Thus, we propose a IT2V generative model, which generates a video clip according to two keyframe images ( $I_{0}$ and $I_{1}$ ) and the description of the motion of this video clip. Specifically, we adopt a stable diffusion model (SD) with LoRA, which fine-tunes the model parameters $\theta$ by training a low-rank residual component $\triangle\theta$ . This residual can be decomposed into products of low-rank matrices. LoRA demonstrates significant efficiency in generating various samples while maintaining consistent semantic identity across different latent noise traversals. The proposed IT2V model is shown in Fig. 2. We first train two LoRAs ( $\Delta\theta_{0}$ and $\Delta\theta_{1}$ ) on the SD UNet $\epsilon_{\theta}$ for each of the two images $I_{0}$ and $I_{1}$ . The learning objective of $\Delta\theta_{i}(i=0,1)$ is:

\mathcal{L}\left(\Delta\theta_{i}\right)=\mathbb{E}_{\epsilon,t}\left[\left\|% \epsilon-\epsilon_{\theta+\Delta\theta_{i}}\left(\mathbf{z}_{t,i},t,\mathbf{c}% _{i}\right)\right\|^{2}\right],

(4)

where $\mathbf{z}_{t,i}=\sqrt{\bar{\alpha}_{t}}\hat{\mathbf{z}}_{i}+\sqrt{1-\bar{% \alpha}_{t}}\epsilon$ is the noised latent embedding at diffusion step $t$ . $\hat{\mathbf{z}}_{i}$ is the VAE encoded latent of the $I_{i}$ image. $\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ is the random sampled Gaussian noise. $\mathbf{c}_{i}$ is the motion embedding encoded from the motion prompt. $\epsilon_{\theta+\Delta\theta_{i}}$ represents the LoRA-integrated UNet. The fine-tuning objective is optimized separately via gradient descent in $\Delta\theta_{0}$ and $\Delta\theta_{1}$ .

Frame and model interpolation. In order to generate $I^{{}^{\prime}}_{w}$ , we first conduct keyframe interpolation as input in the following manner:

I_{w}^{\prime}=w_{i}\times I_{0}+\left(1-w_{i}\right)\times I_{1}.

(5)

Building upon DiffMorpher (Zhang et al. 2024a), we then interpolate the model weight $\Delta\theta_{l}$ according to $\Delta\theta_{0}$ and $\Delta\theta_{1}$ :

\Delta\theta_{l}=w_{l}\times\Delta\theta_{0}+\left(1-w_{l}\right)\times\Delta% \theta_{1}.

(6)

$\Delta\theta_{l}$ is the LoRA parameters, which are integrated to UNet $\epsilon_{\theta+\Delta\theta_{l}}$ .

The uniformly linear interpolation schedule may result in an uneven transition. Thus, we conduct online training for the $w_{i}$ and $w_{l}$ with the constraint of $D(I_{w},\hat{I}_{w})$ at the encoder side, where $D(*)$ is the $\mathcal{L}_{2}$ loss. We only update the $w_{i}$ and $w_{l}$ , as follows:

w^{t+1}_{i}=w^{t}_{i}-\alpha\nabla D(w^{t}_{i}),

(7)

w^{t+1}_{l}=w^{t}_{l}-\alpha\nabla D(w^{t}_{l}),

(8)

where $w^{t}_{i}$ and $w^{t}_{l}$ is the parameter $w_{i}$ and $w_{l}$ at training step $t$ , resepctively. $\alpha$ is the learning rate, set to 0.001, while $\nabla D(*)$ denotes the gradient of the loss function concerning the parameters at the training step $t$ . After obtaining optimal $w^{t}_{i}$ and $w^{t}_{l}$ , we compress and transmit them to the decoder. The VAE decoder then reconstructs the denoised latent representation into the $w$ -th frame, resulting in $\hat{I}_{w}$ .

Experiments

Experimental settings

Datasets. The datasets, including HEVC Class B, C, D, and E, as well as UVG and MCL-JCV, are extensively utilized for evaluating both traditional and neural video codecs. These datasets vary in resolution and content, providing a diverse range of scenarios for comprehensive assessment. To ensure compatibility with various video codecs, we resize videos to dimensions that are multiples of 64 for both width and height.

Table 1: BD-Rate (%) comparison of different video generation models across various datasets in terms of DISTS. The anchor is VTM with QP={52, 50, 47, 45}.

	RIFE	AMT	DiffMorpher	Ours
Class B	4.32	53.97	-45.94	-59.12
Class C	42.58	50.57	-31.06	-17.27
Class D	-4.89	-29.88	-49.06	-52.16
UVG	24.08	15.00	11.55	-21.18
MCL-JCV	83.08	151.20	22.16	-7.25

Comparison methods. There are numerous SOTA foundation models available for video understanding. We choose two prominent models, namely VideoLLaVA (Lin et al. 2023) and VideoLLaMA (Zhang, Li, and Bing 2023), to extract semantic information from videos. This process aligns with the V2T stage depicted in Fig. 1, where the selected models play an important role in extracting semantic descriptions for keyframes and motion. In the TT2V mode, numerous video generation models are available. In this context, we employ advanced video generation models, including Open-Sora (Lab and etc. 2024), VideoCrafter1 (Chen et al. 2023), VideoCrafter2 (Chen et al. 2024a), and AnimateDiff (Guo et al. 2024), for the purpose of generating videos based on textual input. In addition, we compare with the video codec VTM at the extremely low bitrate with QP=63. In the IT2V mode, we conduct a comparative analysis of existing traditional video codecs, such as x264, x265, and VTM (Bross et al. 2021). Alongside this, we evaluate our method against deep video codecs such as DCVC (Li, Li, and Lu 2021) and DCVC-DC (Li, Li, and Lu 2023), but these codecs encounter challenges in achieving extremely low bitrate coding. In addition, we compare the video generation technique, DiffMorpher (Zhang et al. 2024b), which requires keyframe images and motion descriptions for controlling video generation. In our exploration of various video interpolation methods, it is essential to note that these approaches rely solely on keyframes for control, omitting any incorporation of motion descriptions. Furthermore, it should be emphasized that the bit consumption associated with motion text has not been calculated.

Experimental results

Comparison in the TT2V mode. We conduct a comparative analysis of two V2T models, VideoLLaVA and VideoLLaMA, both of which are the SOTA MLLMs. Subsequently, we extend our comparison to include five video generation models: VideoCrafter1, VideoCrafter2, ModelScope, OpenSora, and AnimateDiff. Furthermore, we compare these models against the traditional video codec VTM at QP=63, which results in a higher bitrate than our proposed scheme. Our assessment focuses on five aspects: subject consistency, background consistency, temporal flickering, motion smoothness, and frame quality. The results, illustrated in Fig. 3, indicate that the TT2V generation models outperformed VTM, showcasing better frame-wise quality and consistency in both background and subject representation. These results reflect the average performance across all testing datasets, and detailed comparison results can be found in the supplementary material. The visual quality comparisons illustrated in Fig. 3 indicate that VTM displays considerable blocking artifacts, which severely hinder its ability to convey semantic information.

Table 2: Ablation studies on the keyframe selection strategy.

Models	Settings	BD-Rate(%) $\downarrow$
Sampling strategies	Uniform sampling	-19.705
	Random sampling	-9.859
	MSE	-14.496
	CS	-24.206
Keyframe number	2	-20.001
	3	-5.104
	4	12.665
Keyframe quality	low	-24.206
	middle	-12.043
	high	-2.617

Comparison in the IT2V mode. We compare our model with traditional codecs (x264, x265, and VTM) as well as deep video codecs (DCVC and DCVC-DC). As presented in Fig. 5, we use DISTS to evaluate perceptual quality. Additional comparisons with other evaluation metrics, such as LPIPS, FID, and PSNR, are provided in the supplementary material. However, the pretrained models provided by deep video codecs have limitations in achieving ELB. In addition, we compare our proposed model with various video generation models, including RIFE (Huang et al. 2022), AMT (Li et al. 2023), and DiffMorpher (Zhang et al. 2024a), as detailed in Table 1. By adjusting the number and quality of keyframe images, we can effectively control the bitrate. For our comparisons, we select the optimal results for comparison, where the settings can be found in the supplementary material. Our model exhibits superior performance across most datasets, demonstrating greater stability compared to other video generation models. The visual quality is evaluated at similar bitrates, as shown in Fig. 4 and Fig. 6. Our proposed model exhibits superior perceptual quality in both spatial and temporal dimensions. Additionally, we showcase frames sampled from the decoded videos generated by the TT2V mode and the IT2V mode. The TT2V mode effectively preserves semantic consistency with the ground truth, while the IT2V mode further ensures perceptual consistency.

Ablation studies

Keyframe. We perform ablation studies focused on keyframes, examining various aspects such as keyframe selection methods, the quality of keyframe images, and the number of keyframe images. In our keyframe selection process, we evaluate various sampling strategies, including uniform sampling and random sampling. Given that these techniques do not rely on a distance function, we also compare the sampling strategy with mean-square error (MSE) distance to Ours with cosine similarity (CS) distance. Regarding the quality of keyframe images, we varied the quality levels, including low, medium, and high quality, which correspond to the compression factors 64, 128, and 256, respectively. The results presented in Table 2 indicate that higher quality decoded images result in increased bitrate consumption, such that higher quality does not necessarily lead to a better BD-Rate. Adjusting the number of keyframes based on the frame number of the video, we observe that a lower number of keyframes can maintain a balance between quality and bitrate consumption.

IT2V generative model. We perform ablation studies focused on the different settings, including the influence of motion description, different codecs, updating strategies, training step, and sampling step. In terms of motion description, we compare the model without motion description, as depicted in Table 3. The results indicate that incorporating motion description significantly enhances video reconstruction quality. Additionally, we explore a range of codecs for keyframe images, such as Hyperprior (Ballé et al. 2018), NIC (Chen et al. 2021), and NVTC (Feng et al. 2023). Among these, NVTC stands out by demonstrating superior reconstructed quality while maintaining a lower coding rate. Our model requires updating $w_{i}$ and $w_{l}$ based on the input, such that we further evaluate the effectiveness of updating strategies, as illustrated in Table 3. To assess the effectiveness of these updating strategies, we present further evaluations in Table 3. Moreover, we examine the repercussions of varying training and sampling steps. An increase in the number of sampling steps correlates with improved results. To strike a balance between performance and computational efficiency, we choose 100 training steps and 50 sampling steps for our final implementation.

Table 3: Ablation studies of the IT2V mode.

Models	Settings	BD-Rate(%) $\downarrow$
Motion description	$\times$	8.516
Motion description	$\checkmark$	-24.206
Codecs	Hyperprior	-12.261
	NIC	-16.132
	NTVP	-24.206
Updating $w_{i}$	$\times$	-12.402
Updating $w_{l}$	$\times$	-18.551
Updating $w_{i}$ and $w_{l}$	$\times$	-2.351
Updating $w_{i}$ and $w_{l}$	$\checkmark$	-24.206
Training step	50	-17.864
	100	-24.206
	150	-25.196
Sampling step	20	-4.035
	50	-24.206
	100	-25.207

Discussion

Application. The proposed method enables efficient transmission at low bitrates while preserving semantic content, making it ideal for bandwidth-limited or emergency alert scenarios. When bandwidth allows, keyframe data can be transmitted with textual descriptions, allowing the decoder to reconstruct an approximate video. This hybrid approach balances visual quality with data efficiency, suitable for situations where full video streaming is infeasible, such as disaster response and remote surveillance. Although still in the research phase, the method shows strong potential for real-world applications. With advances in computational resources and model optimization techniques like pruning and quantization, it is expected to become a practical solution for emergency communications and other bandwidth-constrained environments.

Higher bitrate. CMVC includes the TT2V mode and IT2V mode, but more modes can be further explored. For instance, motion representation can be realized through optical flow or trajectories. By integrating multiple modalities of keyframes and motion, we can cater to diverse reconstruction requirements. Furthermore, future efforts should prioritize enhancing CMVC at higher bitrates by integrating more control information to facilitate the reconstruction of the original video. This approach aims to achieve superior performance across all bitrates and dimensions when compared to traditional codecs.

Conclusion

We propose a CMVC paradigm that represents a promising advancement in video coding technology. This framework effectively tackles the challenges of preserving semantic integrity and perceptual consistency at ULB and ELB. By leveraging MLLMs and cross-modality representation techniques, the proposed CMVC framework disentangles videos into content and motion components, transforming them into different modalities for efficient compression and reconstruction. Through the TT2V and IT2V modes, CMVC achieves a balance between semantic information and perceptual quality, offering a comprehensive solution at high compression ratios.

References

Ballé et al. (2018) Ballé, J.; Minnen, D.; Singh, S.; Hwang, S. J.; and Johnston, N. 2018. Variational image compression with a scale hyperprior. In ICLR.
Bhaskaran and Konstantinides (1997) Bhaskaran, V.; and Konstantinides, K. 1997. Image and video compression standards: algorithms and architectures. Springer Science & Business Media.
Blattmann et al. (2023) Blattmann, A.; Rombach, R.; Ling, H.; Dockhorn, T.; Kim, S. W.; Fidler, S.; and Kreis, K. 2023. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 22563–22575.
Bross et al. (2021) Bross, B.; Wang, Y.-K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G. J.; and Ohm, J.-R. 2021. Overview of the Versatile Video Coding (VVC) Standard and its Applications. T-CSVT, 31(10): 3736–3764.
Chai et al. (2023) Chai, W.; Guo, X.; Wang, G.; and Lu, Y. 2023. Stablevideo: Text-driven consistency-aware diffusion video editing. In ICCV, 23040–23050.
Chen et al. (2023) Chen, H.; Xia, M.; He, Y.; Zhang, Y.; Cun, X.; Yang, S.; Xing, J.; Liu, Y.; Chen, Q.; Wang, X.; Weng, C.; and Shan, Y. 2023. VideoCrafter1: Open Diffusion Models for High-Quality Video Generation. arXiv preprint arXiv:2310.19512.
Chen et al. (2024a) Chen, H.; Zhang, Y.; Cun, X.; Xia, M.; Wang, X.; Weng, C.; and Shan, Y. 2024a. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In CVPR, 7310–7320.
Chen et al. (2021) Chen, T.; Liu, H.; Ma, Z.; Shen, Q.; Cao, X.; and Wang, Y. 2021. End-to-End Learnt Image Compression via Non-Local Attention Optimization and Improved Context Modeling. TIP, 30: 3179–3191.
Chen et al. (2024b) Chen, X.; Wang, Y.; Zhang, L.; Zhuang, S.; Ma, X.; Yu, J.; Wang, Y.; Lin, D.; Qiao, Y.; and Liu, Z. 2024b. SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction. In ICLR.
Esser et al. (2023) Esser, P.; Chiu, J.; Atighehchian, P.; Granskog, J.; and Germanidis, A. 2023. Structure and content-guided video synthesis with diffusion models. In ICCV, 7346–7356.
Fei et al. (2024) Fei, H.; Wu, S.; Ji, W.; Zhang, H.; Zhang, M.; Lee, M.-L.; and Hsu, W. 2024. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Forty-first International Conference on Machine Learning.
Feng et al. (2023) Feng, R.; Guo, Z.; Li, W.; and Chen, Z. 2023. NVTC: Nonlinear Vector Transform Coding. In CVPR, 6101–6110.
Gao et al. (2024) Gao, J.; Li, J.; Jia, C.; Wang, S.; Ma, S.; and Gao, W. 2024. Cross Modal Compression With Variable Rate Prompt. TMM, 26: 3444–3456.
Ge et al. (2023) Ge, S.; Nah, S.; Liu, G.; Poon, T.; Tao, A.; Catanzaro, B.; Jacobs, D.; Huang, J.-B.; Liu, M.-Y.; and Balaji, Y. 2023. Preserve your own correlation: A noise prior for video diffusion models. In ICCV, 22930–22941.
Guo et al. (2024) Guo, Y.; Yang, C.; Rao, A.; Wang, Y.; Qiao, Y.; Lin, D.; and Dai, B. 2024. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In ICLR.
Han et al. (2022) Han, L.; Ren, J.; Lee, H.-Y.; Barbieri, F.; Olszewski, K.; Minaee, S.; Metaxas, D.; and Tulyakov, S. 2022. Show me what and tell me how: Video synthesis via multimodal conditioning. In CVPR, 3615–3625.
Ho et al. (2022) Ho, J.; Chan, W.; Saharia, C.; Whang, J.; Gao, R.; Gritsenko, A.; Kingma, D. P.; Poole, B.; Norouzi, M.; Fleet, D. J.; et al. 2022. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303.
Huang et al. (2022) Huang, Z.; Zhang, T.; Heng, W.; Shi, B.; and Zhou, S. 2022. Real-Time Intermediate Flow Estimation for Video Frame Interpolation. In ECCV.
Lab and etc. (2024) Lab, P.-Y.; and etc., T. A. 2024. Open-Sora-Plan. In GitHub.
Li et al. (2024) Li, C.; Wu, X.; Wu, H.; Feng, D.; Zhang, Z.; Lu, G.; Min, X.; Liu, X.; Zhai, G.; and Lin, W. 2024. CMC-Bench: Towards a New Paradigm of Visual Signal Compression. arXiv preprint arXiv:2406.09356.
Li et al. (2021) Li, J.; Jia, C.; Zhang, X.; Ma, S.; and Gao, W. 2021. Cross Modal Compression: Towards Human-comprehensible Semantic Compression. In ACM MM, 4230–4238.
Li, Li, and Lu (2021) Li, J.; Li, B.; and Lu, Y. 2021. Deep Contextual Video Compression. In NeurIPS, volume 34.
Li, Li, and Lu (2023) Li, J.; Li, B.; and Lu, Y. 2023. Neural Video Compression with Diverse Contexts. In CVPR.
Li et al. (2023) Li, Z.; Zhu, Z.-L.; Han, L.-H.; Hou, Q.; Guo, C.-L.; and Cheng, M.-M. 2023. AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation. In CVPR.
Lin et al. (2023) Lin, B.; Zhu, B.; Ye, Y.; Ning, M.; Jin, P.; and Yuan, L. 2023. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
Lu et al. (2022) Lu, G.; Zhong, T.; Geng, J.; Hu, Q.; and Xu, D. 2022. Learning based multi-modality image and video compression. In CVPR, 6083–6092.
Ma et al. (2019) Ma, S.; Zhang, X.; Jia, C.; Zhao, Z.; Wang, S.; and Wang, S. 2019. Image and video compression with neural networks: A review. IEEE Transactions on Circuits and Systems for Video Technology, 30(6): 1683–1698.
Ouyang et al. (2024) Ouyang, H.; Wang, Q.; Xiao, Y.; Bai, Q.; Zhang, J.; Zheng, K.; Zhou, X.; Chen, Q.; and Shen, Y. 2024. Codef: Content deformation fields for temporally consistent video processing. In CVPR, 8089–8099.
Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML.
Ruan et al. (2023) Ruan, L.; Ma, Y.; Yang, H.; He, H.; Liu, B.; Fu, J.; Yuan, N. J.; Jin, Q.; and Guo, B. 2023. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In CVPR, 10219–10228.
Singer et al. (2023) Singer, U.; Polyak, A.; Hayes, T.; Yin, X.; An, J.; Zhang, S.; Hu, Q.; Yang, H.; Ashual, O.; Gafni, O.; et al. 2023. Make-a-video: Text-to-video generation without text-video data. In ICLR.
Voleti et al. (2024) Voleti, V.; Yao, C.-H.; Boss, M.; Letts, A.; Pankratz, D.; Tochilkin, D.; Laforte, C.; Rombach, R.; and Jampani, V. 2024. Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion. arXiv preprint arXiv:2403.12008.
Wang et al. (2023a) Wang, J.; Yuan, H.; Chen, D.; Zhang, Y.; Wang, X.; and Zhang, S. 2023a. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571.
Wang et al. (2023b) Wang, Y.; Chen, X.; Ma, X.; Zhou, S.; Huang, Z.; Wang, Y.; Yang, C.; He, Y.; Yu, J.; Yang, P.; et al. 2023b. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103.
Yin et al. (2023) Yin, S.; Wu, C.; Liang, J.; Shi, J.; Li, H.; Ming, G.; and Duan, N. 2023. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089.
Zhang, Li, and Bing (2023) Zhang, H.; Li, X.; and Bing, L. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
Zhang et al. (2024a) Zhang, K.; Zhou, Y.; Xu, X.; Dai, B.; and Pan, X. 2024a. DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing. In CVPR, 7912–7921.
Zhang et al. (2024b) Zhang, K.; Zhou, Y.; Xu, X.; Dai, B.; and Pan, X. 2024b. DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing. In CVPR, 7912–7921.
Zhang et al. (2023) Zhang, P.; Wang, S.; Wang, M.; Li, J.; Wang, X.; and Kwong, S. 2023. Rethinking Semantic Image Compression: Scalable Representation with Cross-modality Transfer. T-CSVT.