Can video generation replace cinematographers? Research on the cinematic language of generated video

Xiaozhe Li^1,, Kai Wu^2,¹¹1Codes, data, and model weights will be open-sourced after the paper is accepted., Siyi Yang¹, Yizhan Qu¹, Guohua Zhang¹, Zhiyu Chen¹,
Jiayao Li¹, Jiangchuan Mu¹, Xiaobin Hu³, Wen Fang¹,
Mingliang Xiong¹, Hao Deng^1,, Qingwen Liu^1,²²2https://www.videvo.net, Gang Li¹, Bin He¹
¹Tongji University ²ByteDance ³Technical University of Munich
{Lxxzzz,2253110,2251645, 2151368, 2354271, 2351405, 2411941}@tongji.edu.cn
{wen.fang,denghao1984,qliu, lig, hebin}@tongji.edu.cn
wukaiwork@outlook.com xiaobin.hu@tum.de xiongml@foxmail.com Equal contribution.Corresponding authors.

Abstract

Recent advancements in text-to-video (T2V) generation have leveraged diffusion models to enhance visual coherence in videos synthesized from textual descriptions. However, existing research primarily focuses on object motion, often overlooking cinematic language, which is crucial for conveying emotion and narrative pacing in cinematography. To address this, we propose a threefold approach to improve cinematic control in T2V models. First, we introduce a meticulously annotated cinematic language dataset with twenty subcategories, covering shot framing, shot angles, and camera movements, enabling models to learn diverse cinematic styles. Second, we present CameraDiff, which employs LoRA for precise and stable cinematic control, ensuring flexible shot generation. Third, we propose CameraCLIP, designed to evaluate cinematic alignment and guide multi-shot composition. Building on CameraCLIP, we introduce CLIPLoRA, a CLIP-guided dynamic LoRA composition method that adaptively fuses multiple pre-trained cinematic LoRAs, enabling smooth transitions and seamless style blending. Experimental results demonstrate that CameraDiff ensures stable and precise cinematic control, CameraCLIP achieves an R@1 score of 0.83, and CLIPLoRA significantly enhances multi-shot composition within a single video, bridging the gap between automated video generation and professional cinematography.¹

Figure 1: Cinematic language generation results of CameraDiff. CameraDiff generates twenty types of cinematic controls, including shot framing (a), shot angles (b), and camera movements (c) such as lens shifts (e.g., rack focus, zoom in/out) and camera body movements (e.g., tilt up/down). Additionally, it enables the flexible combination of multiple camera movements (d).

1 Introduction

Refer to caption — Figure 2: Pipeline of our threefold approach. (a) Data processing: Stage 1—data collection and classification, Stage 2—human annotation, and Stage 3—manual verification. (b) CameraCLIP training: The text encoder is trained on the last two layers, while the image encoder is trained on the last four layers. Each video is uniformly sampled into eight frames, encoded via the image encoder, and mean-pooled to obtain video features. These features, combined with text features, undergo contrastive learning in a joint space to enhance similarity. (c) CameraDiff: LoRA enables single-shot cinematic control, while CLIPLoRA facilitates multi-shot composition within a single video.

Recent advancements in text-to-video (T2V) generation have leveraged diffusion models to enhance the visual coherence of videos synthesized from textual descriptions [14, 24, 26, 4, 6, 15, 38]. However, most T2V research has primarily focused on object motion, prioritizing content alignment and visual consistency. Additionally, the high labor costs associated with collecting cinematic language data from real-world scenarios have hindered progress in text-controlled cinematic video generation, leaving it largely unexplored.

Current T2V models cannot replace skilled cinematographers due to their limited ability to generate complex cinematic language in videos. Previous works, such as AnimateDiff [12], introduced motion priors to enhance animation fluidity and utilized MotionLoRA for efficient fine-tuning of motion patterns. However, these methods restricted camera motion to basic operations like zooming and rotation. MotionCtrl [40] introduced modules for independent control of camera and object motion, while Direct-a-Video [43] aimed to replicate real-world video shooting by decoupling object and camera motion, enabling user control over translation, zoom, and directional movement. Despite these advancements, both MotionCtrl and Direct-a-Video struggle to generate a full range of cinematic styles, particularly in shot framing and shot angle control, and require additional camera parameter inputs to control camera movement, limiting flexible camera manipulation.

While these advancements have established a foundation for camera control in T2V models, significant limitations persist in generating diverse cinematic styles. The key challenges are: (1) the lack of a high-quality cinematic language dataset and (2) the absence of a method for multi-cinematic language composition to flexibly generate multi-shot videos. For example, producing a ”long shot, eye-level, panning right, zoom-in, and tilt-up view of a bear walking in the snow near a lake” remains difficult. To address these challenges, we propose a threefold approach to enhance T2V models’ capability to generate high-quality, multi-shot composition videos.

Firstly, we introduce Cinematic2K, a comprehensive video cinematic language dataset comprising twenty subcategories and approximately 2,000 videos covering shot framing, shot angles, and camera movements. The dataset consists of video-text pairs representing diverse shot types captured from real-world scenarios. To ensure high quality, we perform meticulous human annotation and verification.

Secondly, we propose CameraDiff, which leverages Cinematic2K to enable precise control over cinematic language patterns, encompassing twenty cinematic types. This includes four shot framing styles, five shot angles, and eleven camera movements, such as lens shifts and camera body movements.

Finally, we introduce CameraCLIP, which enhances the CLIP [22] model’s ability to comprehend complex cinematic language in video content. Building on CameraCLIP, we propose CLIPLoRA, which employs CameraCLIP as an evaluator to dynamically guide LoRA composition. This approach adaptively selects the optimal pre-trained cinematic LoRAs for activation during the denoising stage, ensuring smooth transitions and seamless blending of cinematic styles within a single video.

Our specific contributions can be summarized as follows:

•

We introduce Cinematic2K, a cinematic language video dataset comprising approximately 2,000 high-quality videos across twenty cinematic subcategories, including shot framing, shot angles, and camera movements. Additionally, we propose CameraDiff, which enables the generation of twenty distinct cinematic patterns.
•

We introduce CameraCLIP, a model with a strong capability to align complex cinematic language with corresponding videos. Building on this, we propose CLIPLoRA, a CLIP-guided LoRA composition method that enables seamless multi-shot fusion within a single video.
•

Experimental results show that CameraCLIP outperforms popular models in aligning videos with their cinematic language descriptions, achieving an R@1 score of 0.83. CameraDiff provides stable control over twenty shot types, while CLIPLoRA effectively composites multiple LoRAs.

2 Related Works

Image/Video CLIP Models. CLIP [22] is a multi-modal model based on contrastive learning, commonly used to assess image-text similarity. Trained on large-scale text-image pairs, CLIP excels in zero-shot generalization, enabling its application to tasks such as detection [11, 19], segmentation [42, 17], video understanding [20, 41], retrieval [36], and image generation [23, 9, 8, 34].For video analysis, ViCLIP [39] enhances video analysis by incorporating spatio-temporal attention and partial random patch masking, while Long-CLIP [44] addresses embedding length limitations. VideoCLIP-XL [37] improves alignment for long-form video descriptions via text-similarity-guided matching. Despite these advancements, existing models focus on overall video semantics or long-text inputs, lacking specific attention to cinematic language like shot framing, angle, and camera movement. To fill this gap, we introduce CameraCLIP, which enhances the model’s understanding of these cinematic elements.

Text-to-Video Generation for Camera Control. Early video generation research relied on GANs and VAEs [27, 31, 32, 35]. With the success of diffusion models in high-fidelity image generation [14, 24, 26], diffusion-based approaches have gained traction for video generation, improving visual coherence in text or image-guided content [4, 6, 15, 38, 3]. Recent works in T2V models have focused on camera control. AnimateDiff [12] introduced motion priors for smoother animations, while CameraCTRL [13], MotionCtrl [40] and Direct-a-Video [43] offered more independent control over camera and object motion. However, AnimateDiff is limited to basic camera movements, and CameraCTRL, MotionCtrl and Direct-a-Video require extra camera parameters, restricting flexible control.

Cinematic Dataset for T2V Model. Despite the availability of high-quality datasets for image/video understanding [5, 7, 21, 18] and generation [30, 2], comprehensive datasets for cinematic language understanding and generation remain scarce. RealEstate10K [46] captures only camera movement trajectories but includes numerous irregular motions, making precise control over basic operations like zooming or panning challenging. CineScale [28] and CineScale2 [29] focus on shot framing and angles, respectively, yet lack detailed camera movement information, as shown in Table 1, thereby limiting expressive cinematic video generation.

Multi-LoRA Composition. Low-Rank Adaptation (LoRA) [16] has shown great potential in text-to-image/video generation, enabling high-quality content with low computational cost. Building on this, LoRA Merge [25] was introduced to combine multiple LoRAs into a unified LoRA for diffusion models. Zhong [45] expanded this with LoRA Switch and LoRA Composite, which focus on the diffusion denoising stage. LoRA Switch alternates LoRAs at specific denoising steps, while LoRA Composite averages guidance scores to improve composition quality and robustness However, these methods are static and lack dynamic adaptation to specific tasks, limiting their ability to model complex cinematic language. To address this, we propose CLIPLoRA, a CLIP-guided composition approach that leverages cameraCLIP as an evaluator to optimize LoRA selection during diffusion, enhancing the realism of camera motion in video generation.

3 Method

Our threefold approach begins with data collection and processing to construct Cinematic2K (Section 3.1). Using cinematic captions and videos from Cinematic2K, we propose CameraCLIP to enhance cinematic text-video alignment (Section 3.2). Finally, we introduce CameraDiff, which first enables single-shot control using twenty classified cinematic videos and then utilizes CameraCLIP as an evaluator to guide multi-shot composition (Section 3.3).

3.1 Cinematic Language Dataset Construction

Dataset	M/F/A	Annotation
RealEstate10K [46]	✓ , ✗ , ✗	✗
CineScale [28]	✗ , ✓ , ✗	✓
CineScale2 [29]	✗ , ✗ , ✓	✓
Cinematic2K	✓ , ✓ , ✓	✓

Table 1: Comparison of video datasets in terms of cinematic language attributes, including camera movement (M), shot framing (F), and shot angle (A), as well as the availability of annotations. Cinematic2K provides the most comprehensive coverage, supporting all three attributes with full annotation, whereas other datasets lack complete coverage of these aspects.

As shown in Table 1, existing cinematic datasets are not systematically organized. These datasets either do not contain sufficient cinematic language or lack specific cinematic types with their own annotations. Therefore, it is necessary to construct the cinematic language dataset to bridge this gap. We sourced videos from Pexels¹ and Videvo², two platforms providing free high-quality video data. The dataset construction process, as illustrated in Figure 2 (a), was divided into three steps.

¹¹footnotetext: https://www.pexels.com

Step I: Cinematic Language Classification and Video Data Collection. We categorized cinematic language into three primary types: shot framing, shot angle, and camera movement. To efficiently build a high-quality cinematic language dataset on a large scale, we employed automated tools to extract content descriptions and cinematic language annotations for each video. In total, we gathered approximately 2,000 video samples. The detailed data distribution of the cinematic language dataset is shown in Figure 3. ³ ³³footnotetext: Detailed descriptions of these 20 cinematic language types are provided in the Supplementary Material.

Step II: Fine-grained Annotation of Cinematic Language Intervals. Since the collected data originates from real-world scenarios, it may contain noise, such as inconsistent camera movements or transitions in shot types like ’rack focus,’ which typically unfold over a few seconds. To enhance data quality, we manually annotated each video after Step I, precisely marking the start and end intervals of these fine-grained camera movement transitions.

Step III: Verification of Cinematic Language Annotations. As each video’s cinematic language comprises shot framing, shot angle, and camera movement, automatically generated descriptions from online sources often contain inaccuracies or omit certain shot types. To address this, after completing the annotation process in Step II, we manually proofread and refine the dataset, supplementing missing shot types to ensure accuracy and completeness. Following human annotation, each cinematic entry is structured as: ”shot angle” + ”camera movement” + ”shot framing.” Additionally, videos with distinct cinematic features are labeled as ”Typical Video,” with a complete example provided in the Supplementary Materials.

After the above data processing, we construct Cinematic2K, a dataset of approximately 2,000 high-quality cinematic language videos with meticulous human annotation and verification. Notably, real-world camera movement is complex, and excessive irrelevant motion can negatively impact data quality. Therefore, Cinematic2K strikes a balance between data quality and quantity to ensure reliable annotations while maintaining a diverse dataset.

\animategraphics[width=]8figs/close_up_shot/frame_0015

\animategraphics[width=]8figs/medium_shot/frame_0015

\animategraphics[width=]8figs/full_shot/frame_0015

\animategraphics[width=]8figs/long_shot/frame_0015

close up shot

medium shot

full shot

long shot

\animategraphics[width=]8figs/low_angle/frame_0015

\animategraphics[width=]8figs/eye_shot/frame_0015

\animategraphics[width=]8figs/high_angle/frame_0015

\animategraphics[width=]8figs/dutch_angle/frame_0015

\animategraphics[width=]8figs/bird_angle/frame_0015

low angle

eye level

high angle

dutch angle

bird angle

\animategraphics[width=]8figs/dolly_in/frame_0015

\animategraphics[width=]8figs/dolly_out/frame_0015

\animategraphics[width=]8figs/panning_left/frame_0015

\animategraphics[width=]8figs/panning_right/frame_0015

\animategraphics[width=]8figs/rack_focus/frame_0015

\animategraphics[width=]8figs/still/frame_0015

dolly in

dolly out

panning left

panning right

rack focus

still

\animategraphics[width=]8figs/tilt_up/frame_0015

\animategraphics[width=]8figs/tilt_down/frame_0015

\animategraphics[width=]8figs/tracking_shot/frame_0015

\animategraphics[width=]8figs/zoom_in/frame_0015

\animategraphics[width=]8figs/zoom_out/frame_0015

tilt up

tilt down

tracking shot

zoom in

zoom out

Figure 4: Qualitative results of single-shot generation in CameraDiff. CameraDiff enables the generation of specific cinematic language for individual shot types. The first and second rows illustrate control over shot framing and shot angles, respectively, while the third and fourth rows demonstrate control over camera movements. Each cinematic type is annotated below the figure. Please open in Acrobat Reader and click the image to play the animation.

3.2 CameraCLIP

\animategraphics[width=]8figs/td_pl/frame_0015

\animategraphics[width=]8figs/pr_zoomin/frame_0015

\animategraphics[width=]8figs/zoomout_pl/frame_0015

\animategraphics[width=]8figs/zoomin_tilt_up_pr/frame_0015

\animategraphics[width=]8figs/zoomout_td_pl/frame_0015

Long shot, low angle, panning left and tilt down

Long shot, high level, zoom in and panning right

Long shot, eye level, zoom out and panning left

Full shot, eye level, zoom in, panning right and tilt up

Long shot, low angle, zoom out, panning left and tilt down

Figure 5: Qualitative results of multi-shot composition in CameraDiff. We combine single-shot LoRAs using CLIPLoRA to achieve blending of multiple shots within a single video, with cinematic language details provided below each figure. Please open in Acrobat Reader and click the image to play the animation.

Our CameraCLIP extends the capabilities of the pre-trained CLIP model [22] to the domain of cinematic language video-text alignment. Figure 2 (b) presents the architecture of our model, with its key components described in detail below.

Text Encoder. To adapt the CLIP model to our task while preserving its strong generalization capabilities, we fine-tune only the last two layers (layers 10 and 11) while freezing all earlier layers. This selective fine-tuning allows the text encoder to effectively capture cinematic language without overfitting, maintaining its broader linguistic representation.

Vision Transformer (ViT) Encoder. To obtain a compact representation of the entire video, we sample 8 frames per video, resize and embed them, and then apply mean pooling to aggregate their features:

V=\frac{1}{N}\sum_{i=1}^{N}I_{i},

where $N$ denotes the number of sampled frames, and $I_{i}$ represents the feature of the $i$ -th frame. This approach balances information coverage and computational efficiency. To further adapt the ViT encoder to cinematic language tasks, we fine-tune its higher layers (layers 20 to 23), refining its visual representations while preserving the pre-trained model’s generalization ability.

Text-Video Similarity Computation. The similarity between the video feature vector $V$ and the text feature vector $T$ is computed using cosine similarity:

\text{cosine\_sim}(V,T)=\frac{V\cdot T}{\|V\|_{2}\|T\|_{2}},

(1)

where:

•

$V$ and $T$ are the video and text feature vectors, respectively.
•

$\|V\|_{2}$ and $\|T\|_{2}$ denote their $L_{2}$ -norms (Euclidean norms).
•

$\cdot$ represents the dot product between the vectors.

The cosine similarity score ranges from $-1$ (completely dissimilar) to $+1$ (identical), with higher values indicating stronger text-video alignment. Normalizing the vectors ensures that the similarity measure depends only on their angular distance rather than magnitude.

CameraCLIP Loss. To further align video and text modalities, we employ the InfoNCE loss, which maximizes the similarity of matching pairs while minimizing it for non-matching pairs (see Supplementary Material for details).

3.3 CameraDiff and CLIPLoRA

As illustrated in Figure 2 (c), CameraDiff first takes classified videos from the 20 categories in Cinematic2K as input, using LoRA to train the T2V model on categorized cinematic pattern videos, enabling single-shot control. Each LoRA module encapsulates a distinct cinematic style, ranging from fundamental shot framing and angles to complex movements such as long shot, dutch angle and rack focus. More generated results are presented in Figure 4.

Furthermore, we aim to enable multi-camera movement control within a single video. While previous multi-LoRA composition methods, including LoRA Merge, Switch, and Composite, have made notable progress in image generation, they rely on static, manually designed strategies. These approaches struggle to model the dynamic interactions between multiple LoRAs, leading to spatiotemporal inconsistencies and visual artifacts in video generation. To overcome these limitations, we propose CLIPLoRA, a method specifically designed for cinematic video generation that enables adaptive multi-LoRA composition.

We aim to determine the optimal LoRA to activate at each diffusion step during the denoising process. Given $k$ available LoRAs and $H$ diffusion steps, the exhaustive search space of $k^{H}$ permutations is computationally infeasible. Inspired by LoRA Switch and genetic algorithms [10, 1], we represent each individual in the population as a sequence of LoRA compositions activated across the diffusion steps, sampled from the set $P$ of available LoRAs. This adaptive approach efficiently explores the optimal LoRA activation sequence, balancing computational complexity and generation quality. The algorithm proceeds as follows:

1.

Initialization: Randomly sample an initial population of $N$ LoRA composition sequences $L_{i}$ , $i\in\{1,2,\dots,N\}$ , each representing a LoRA assignment across all diffusion steps.

Evaluation: Generate videos based on a set of test prompts $S$ using each sequence $L_{i}$ , and compute the average CameraCLIP score as the fitness:

F(L_{i})=\frac{1}{|S|}\sum_{s\in S}\text{CameraCLIP}(L_{i},s).

(2)

3.

Selection: Select the top $p\%$ of individuals with the highest fitness scores as elite candidates for the next generation.
4.

Crossover: Perform crossover among selected individuals, exchanging segments of LoRA sequences between pairs of parents to generate diverse offspring.
5.

Mutation: With mutation probability $p_{m}$ , randomly alter individual genes (LoRA assignments at specific diffusion steps) to further explore the search space.
6.

Iteration: Repeat steps 2–5 for $T$ generations until convergence or a predefined stopping criterion is satisfied.

This approach enables an efficient search for optimal LoRA configurations, ensuring high-quality video generation while significantly reducing the computational cost.

4 Experiment

In this section, we evaluate the performance of CameraCLIP and CameraDiff. We begin by outlining the basic experimental settings. Next, we assess CameraCLIP’s video-to-text retrieval accuracy on the validation dataset derived from Cinematic2K (see Section 4.2). We then present the qualitative and quantitative results of CameraDiff, including single-shot generation and multi-camera motion composition using CLIPLoRA (see Section 4.3). Finally, we conduct an ablation study to analyze our temporal modeling method (see Section 4.4.1) and CameraCLIP’s performance as the evaluator in CLIPLoRA (see Section 4.4.2).

4.1 Setup

Video CLIP Models. We evaluate four representative models for video-text alignment: CLIP4CLIP [20], ViCLIP [39], LongCLIP [44], and VideoCLIP-XL [37]. To assess their performance, we split Cinematic2K into a training set for fine-tuning CameraCLIP and a validation set for benchmarking various VideoCLIP models. Detailed dataset information is provided in the Supplementary Materials.

Evaluation Metrics. We evaluate the alignment between text and video using recall at 1 (R@1), which represents the proportion of queries for which the correct answer is found within the top-1 returned result. The quality of generated videos is evaluated using Fréchet Video Distance (FVD) [33] and CLIPSIM [22], which assess visual fidelity, temporal coherence, and alignment with cinematic language, respectively. FVD is computed using reference videos from WebVid10M [2].

LoRA Composition Baselines. We compare our CLIPLoRA method with direct generation method (Origin) and several LoRA composition baselines: LoRA Merge, Switch, and Composite, for generating multi-shot control in a single video, using Animatediff [12] as the backbone model. The detailed experimental settings for CameraDiff and the genetic algorithm used to search for optimal LoRA composition are provided in the Supplementary Material.

4.2 CameraCLIP

In this section, we evaluate the performance of our proposed CameraCLIP on the R@1 metric through experiments with different representative video CLIP models and resolutions.

Models	Different Types	R@1 $\uparrow$
CLIP [22]	ViT-B/32-224px	0.63
	ViT-B/16-224px	0.64
	ViT-L/14-224px	0.68
	ViT-L/14-336px	0.71
CLIP4CLIP [20]	ViT-B/32 meanP	0.71
	ViT-B/32 seqLSTM	0.69
	ViT-B/32 seqTransf	0.66
	ViT-B/16 meanP	0.72
	ViT-B/16 seqLSTM	0.70
	ViT-B/16 seqTransf	0.72
ViCLIP [39]	ViCLIP-B-16	0.64
	ViCLIP-L-14	0.68
LongCLIP [44]	LongCLIP-B	0.55
LongCLIP [44]	LongCLIP-L	0.75
VideoCLIP-XL [37]	VideoCLIP-XL	0.77
VideoCLIP-XL [37]	VideoCLIP-XL-V2	0.76
CameraCLIP (Ours)	ViT-B/32-224px	0.69
	ViT-B/16-224px	0.73
	ViT-L/14-224px	0.77
	ViT-L/14-336px	0.83

Table 2: Quantitative comparison with other video CLIP models. CameraCLIP outperforms other models on the validation set derived from the Cinematic2K, achieving the best performance. The best-performing result is highlighted in bold.

As shown in Table 2, traditional CLIP models with varying Vision Transformer (ViT) sizes and input resolutions exhibit moderate accuracy, with larger models and higher resolutions yielding better performance (e.g., ViT-B/32-224px achieves 0.63, while ViT-L/14-336px reaches 0.71). CLIP4CLIP and ViCLIP, which incorporate additional layers or sequence models, offer slight improvements, peaking at 0.72. LongCLIP benefits from extended input length, achieving an R@1 score of 0.75, while VideoCLIP-XL further improves to 0.77 using a text-similarity-guided primary component matching mechanism.

Our proposed CameraCLIP surpasses all baselines, achieving an R@1 score of 0.83 with the ViT-L/14-336px configuration, demonstrating superior understanding of cinematic language, including shot framing, shot angles, and camera movement. Furthermore, CameraCLIP consistently outperforms traditional CLIP models across all configurations, confirming that fine-tuning on a specialized cinematic dataset enhances video-specific linguistic alignment. These results validate CameraCLIP’s effectiveness in bridging cinematic language with video content.

4.3 CameraDiff

The qualitative results of CameraDiff for single-shot generation are presented in Figure 4. After obtaining LoRA weights for specific cinematic styles, we apply CLIPLoRA to compose multiple shots within a single video. The corresponding qualitative results are shown in Figure 5.

First, we use CLIPSIM to measure cinematic text-video consistency as the number of LoRAs increases, as shown in Figure 6. CLIPLoRA consistently outperforms all baselines while maintaining stable performance, whereas traditional static methods suffer from detail loss and disjointed transitions, particularly with a higher number of LoRAs. These results highlight the effectiveness of CLIPLoRA for multi-shot integration.

Additionally, we use FVD to evaluate the generated video’s visual fidelity and temporal coherence, as shown in Table 3. The results demonstrate that CLIPLoRA significantly outperforms existing methods in both FVD and CLIPSIM, confirming its superiority in generating high-quality composite multi-shot videos.

	Origin	Merge	Switch	Composite	CLIPLoRA
FVD $\downarrow$	2534	2358	2730	2404	1837
CLIPSIM $\uparrow$	0.2268	0.2375	0.2336	0.2394	0.2535

Table 3: Quantitative comparison with other LoRA composition methods. Our proposed CLIPLoRA surpasses existing approaches in both text-video similarity and generated quality of cinematic language videos, with the highest score highlighted in bold.

4.4 Ablation Study

4.4.1 Temporal Modeling Method

To validate the effectiveness of the mean pooling approach for modeling video frames, we conducted an ablation study comparing various temporal modeling methods, including Transformer, LSTM, MLP, Multi-Head Attention, and Transformer+LSTM. As shown in Table 4, Transformer achieved the second-best R@1 score of 0.43, while mean pooling significantly outperformed it with an accuracy of 0.81. This suggests that, for our small dataset task, mean pooling provides a more robust frame representation, as complex models tend to overfit and reduce generalization, consistent with prior work’s conclusion [20].

Temporal Modeling Methods	R@1 $\uparrow$
Transformer	0.43
LSTM	0.34
MLP	0.29
Multi-Head Attention	0.39
Transformer+LSTM	0.32
Mean Pooling(Ours)	0.83

Table 4: Quantitative comparison with other temporal modeling methods to train CameraCLIP. We evaluate their performance on the cinematic validation set using the R@1 score, with the highest accuracy highlighted in bold.

4.4.2 CameraCLIP as Evaluator

	CLIP	LongCLIP	VideoCLIP-XL	CLIPLoRA
FVD $\downarrow$	1958	1957	1921	1837
CLIPSIM $\uparrow$	0.2396	0.2417	0.2427	0.2535

Table 5: Quantitative comparison with other video CLIP models. Our proposed CameraCLIP serves as an evaluator to guide multi-LoRA selection during the denoising stage, outperforming other video CLIP models in both text-video similarity and cinematic video quality, with the highest score highlighted in bold.

Table 5 presents a quantitative comparison of our proposed CameraCLIP with other video CLIP models used as evaluators in the search for optimal LoRA compositions. CameraCLIP outperforms the other models in both CLIPSIM and FVD, demonstrating that, by guiding multi-LoRA selection during the denoising stage, CameraCLIP effectively enhances both text-video similarity and cinematic video generation quality. Furthermore, as shown in Table 3, the dynamic LoRA composition methods outperform the traditional static composition approach, suggesting that, in video generation, a carefully crafted LoRA merge strategy can significantly improve video quality. This also provides insight for researchers focusing on the LoRA composition stage to enhance customized video quality.

5 Conclusion

In this work, we focus on enhancing the capability of T2V models to generate cinematic language videos. To achieve this, we propose a threefold solution. First, we introduce Cinematic2K, a comprehensive dataset encompassing shot framing, shot angles, and camera movements. Second, we present CameraDiff, which leverages LoRA for precise cinematic control. Finally, to improve multi-shot composition, we propose CLIPLoRA, using CameraCLIP as the evaluator to guide dynamic LoRA composition, facilitating smooth transitions and realistic blending of cinematic styles within a single video.

Experimental results show that CameraCLIP achieves an R@1 score of 0.83, surpassing existing models in cinematic text-video alignment. CameraDiff ensures stable single-shot control, while CLIPLoRA integrates multiple LoRAs to generate complex cinematic compositions and provides insights for enhancing customized video quality. These contributions establish a foundation for controllable cinematic language generation, bridging the gap between automated video synthesis and expert cinematography.

References

Alam et al. [2020] Tanweer Alam, Shamimul Qamar, Amit Dixit, and Mohamed Benaida. Genetic algorithm: Reviews, implementations, and applications. arXiv preprint arXiv:2007.12673, 2020.
Bain et al. [2021] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1728–1738, 2021.
Blattmann et al. [2023a] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
Blattmann et al. [2023b] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023b.
Chen and Dolan [2011] David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011.
Chen et al. [2023] Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
Chen et al. [2024] Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024.
Crowson et al. [2022] Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision, pages 88–105. Springer, 2022.
Frans et al. [2022] Kevin Frans, Lisa Soros, and Olaf Witkowski. Clipdraw: Exploring text-to-drawing synthesis through language-image encoders. In Advances in Neural Information Processing Systems, 2022.
Golberg [1989] David E Golberg. Genetic algorithms in search, optimization, and machine learning. Addion wesley, 1989(102):36, 1989.
Gu et al. [2021] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, 2021.
Guo et al. [2024] Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In The Twelfth International Conference on Learning Representations, 2024.
He et al. [2024] Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Ho et al. [2022] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
Hu et al. [2021] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
Li et al. [2022a] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic segmentation. In International Conference on Learning Representations, 2022a.
Li et al. [2024] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024.
Li et al. [2022b] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022b.
Luo et al. [2022] Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
Nan et al. [2024] Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371, 2024.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Ryu [2023] S. Ryu. Merging loras. https://github.com/cloneofsimo/lora, 2023. Accessed: 2024-11-06.
Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
Saito et al. [2017] Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular value clipping. In Proceedings of the IEEE international conference on computer vision, pages 2830–2839, 2017.
Savardi et al. [2021] Mattia Savardi, András Bálint Kovács, Alberto Signoroni, and Sergio Benini. Cinescale: A dataset of cinematic shot scale in movies. Data in Brief, 36:107002, 2021.
Savardi et al. [2023] Mattia Savardi, András Bálint Kovács, Alberto Signoroni, and Sergio Benini. Cinescale2: a dataset of cinematic camera features in movies. Data in Brief, 51:109627, 2023.
Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems, 35:25278–25294, 2022.
Skorokhodov et al. [2022] Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elhoseiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3626–3636, 2022.
Tulyakov et al. [2018] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018.
Unterthiner et al. [2018] Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
Vinker et al. [2022] Yael Vinker, Ehsan Pajouheshgar, Jessica Y Bo, Roman Christian Bachmann, Amit Haim Bermano, Daniel Cohen-Or, Amir Zamir, and Ariel Shamir. Clipasso: Semantically-aware object sketching. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022.
Vondrick et al. [2016] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. Advances in neural information processing systems, 29, 2016.
Wang et al. [2023a] Jiapeng Wang, Chengyu Wang, Xiaodan Wang, Jun Huang, and Lianwen Jin. Cocaclip: Exploring distillation of fully-connected knowledge interaction graph for lightweight text-image retrieval. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 71–80, 2023a.
Wang et al. [2024a] Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, and Lianwen Jin. Videoclip-xl: Advancing long description understanding for video clip models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16061–16075, 2024a.
Wang et al. [2024b] Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36, 2024b.
Wang et al. [2023b] Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. In The Twelfth International Conference on Learning Representations, 2023b.
Wang et al. [2024c] Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. In ACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024c.
Xu et al. [2021] Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6787–6800, 2021.
Xu et al. [2022] Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18134–18144, 2022.
Yang et al. [2024] Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user-directed camera movement and object motion. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ’24, page 1–12. ACM, 2024.
Zhang et al. [2025] Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. In European Conference on Computer Vision, pages 310–325. Springer, 2025.
Zhong et al. [2024] Ming Zhong, Yelong Shen, Shuohang Wang, Yadong Lu, Yizhu Jiao, Siru Ouyang, Donghan Yu, Jiawei Han, and Weizhu Chen. Multi-lora composition for image generation. arXiv preprint arXiv:2402.16843, 2024.
Zhou et al. [2018] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images. ACM Transactions on Graphics (TOG), 37(4):1–12, 2018.