(Translated by https://www.hiragana.jp/)
ReWind: Understanding Long Videos with Instructed Learnable Memory

ReWind: Understanding Long Videos with Instructed Learnable Memory

Anxhelo Diko1  Tinghuai Wang211footnotemark: 1  Wassim Swaileh2  Shiyan Sun2  Ioannis Patras2
1La Sapienza University of Roma 2Huawei Helsinki Research Center
diko@di.uniroma1.it {tinghuaiwang,shiyansun,wassim.swaileh,ioannis.patras}@huawei.com
Equal ContributionWork done while at Huawei
Abstract

Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. However, existing VLMs struggle with long videos due to computational inefficiency, memory limitations, and difficulties in maintaining coherent understanding across extended sequences. To address these challenges, we introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. ReWind operates in a two-stage framework. In the first stage, ReWind maintains a dynamic learnable memory module with a novel read-perceive-write cycle that stores and updates instruction-relevant visual information as the video unfolds. This module utilizes learnable queries and cross-attentions between memory contents and the input stream, ensuring low memory requirements by scaling linearly with the number of tokens. In the second stage, we propose an adaptive frame selection mechanism guided by the memory content to identify instruction-relevant key moments. It enriches the memory representations with detailed spatial information by selecting a few high-resolution frames, which are then combined with the memory contents and fed into a Large Language Model (LLM) to generate the final answer. We empirically demonstrate ReWind’s superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks. Notably, ReWind achieves a +13% score gain and a +12% accuracy improvement on the MovieChat-1K VQA dataset and an +8% mIoU increase on Charades-STA for temporal grounding.

1 Introduction

Large Language Models (LLMs) [23, 22] have demonstrated remarkable capabilities at human language processing [24, 3]. However, these models are limited to text-based inputs and, therefore, oblivious to real-world, multi-sensory information. To address this limitation, researchers are actively developing Multimodal LLMs (MLLMs) capable of processing signals from multiple and diverse modalities [13, 15, 18], including images, video, and audio. This emerging field holds immense potential for applications such as visual question answering (VQA), real-time interfaces for autonomous agents, and generating detailed scene descriptions for the visually impaired.

Refer to caption
Figure 1: ReWind is a memory-based VLM framework designed for long video understanding (10+ minutes), specialized in VQA and temporal grounding. The highlighted frames are selected from ReWind’s dynamic frame selection mechanism.

Recent research in MLLMs has predominantly concentrated on Vison-Language Models (VLMs) [27, 28, 11], which typically combine pre-trained LLMs with visual encoders that encode and feed to them visual information. However, existing VLMs face two major challenges in processing long videos. First, their self-attention mechanisms require substantial memory that scales quadratically with the number of tokens, making long video processing computationally intensive. Second, these models struggle to effectively model temporal dependencies over extended sequences. To address these challenges, recent efforts have proposed using memory modules to enhance the capability of VLMs [21, 10]. However, current memory modules often serve as storage units and lack the ability to discern and retain information pertinent to the task or user instructions. Moreover, these models tend to compress temporal information heavily [21], sacrificing the fidelity of the temporal dynamics and overlooking critical details in the video’s narrative. Additionally, current models rely on fixed dense spatial representations per frame [10, 21], increasing memory requirements: by treating all frames equally, they store unnecessary details for non-essential moments, increasing memory demands and limiting the model’s ability to focus on critical events for accurate video comprehension.

To address these long video challenges, we introduce ReWind, a novel memory-based framework that operates in two stages, advancing the state of the art with key innovations. In the first stage (Stage-1 in Fig. 2), through a learnable memory module, Rewind enables instruction-guided feature encoding and storage into a memory bank of coherent temporal information. At its core, ReWind features a novel read-perceive-write cycle: First (Read Cross Attention), a read operation looks at historical context from memory and produces fixed-size read queries. Then those are used as queries in a perceiver unit (Perceiver Block) that processes tokens from the encoder of incoming frames. Unlike previous Q-Former approaches (e.g.,[27]) that compress information at clip-level, our novel design allows memory-informed processing of the incoming frames and preserves temporal fidelity. Finally, the perceiver’s representations of the input tokens flow into the write operation (Write Cross Attention), where learnable write queries distill and filter information through. The resulting compact representations are then stored in memory, enabling ReWind to progressively build coherent temporal representations while avoiding the compression issues present in previous works [21, 10, 27]. Crucially, in this stage, we avoid cross-attention between the memory and the video stream, as well as self-attention within the stream tokens with high computational demand. In the second stage (Stage-2 in Fig. 2), ReWind ’rewinds’ the video stream and dynamically selects frames by a selection mechanism guided by the memory contents and the user instructions. The selection mechanism operates on high spatial resolution tokens from the input stream so that after selection, tokens from both the memory bank and the dense selection outputs are fed into an LLM that generates the response. By contrast to previous works that maintain fixed-size dense representations for each frame [21, 27], this selection strategy incorporates detailed spatial information only for relevant key events, resulting in reduced memory requirements.

In our extensive evaluations, ReWind demonstrates superior performance compared to previous state-of-the-art methods across both long and short-term video question answering [21, 5, 25, 14] and temporal grounding video benchmarks [9, 4], validating the effectiveness of our approach. Additionally, detailed ablation motivates our design choices. In summary, the main contributions of this work are threefold:

  • ReWind, a novel memory-based vision-language model that enables efficient understanding of long videos while maintaining temporal fidelity.

  • A learnable memory module with an innovative read-perceive-write cycle that enables instruction-guided feature encoding and robust temporal representation construction.

  • An adaptive frame selection mechanism that identifies instruction-relevant key moments and enriches memory representations with detailed spatial information for comprehensive video understanding.

2 Related Works

2.1 Short Video Understanding

Recent VLMs have explored various architectural approaches for video understanding. Dual-stream architectures, exemplified by Video-LLaMA [27] and VideoChat [14], process different modalities separately. The former processes both audio and visual information separately using Q-Formers [29]. The latter processes video using specialized embedding models and a perception toolkit for mixed modalities. In contrast, single-stream approaches like Video-ChatGPT [18] employ spatiotemporal pooling to capture the overall video context. Video-LLaVA [17] utilizes a LanguageBind [30] module to map multimodal inputs into a shared space. Mirasol3B [19] proposes a decoder-only model adapted to handle multimodal input, representing them in disentangled spaces. ChatUniVi [12] takes a unique approach by introducing a unified visual representation through dynamic visual tokens for both images and videos.

Refer to caption
Figure 2: ReWind’s VLM architecture for long video processing is illustrated in (a). It employs a two-stage processing scheme. In Stage 1 (black arrows), ReWind sequentially processes each video sub-clip using a visual encoder and a text-conditioned perceiver layer supported by a learnable memory module. This module performs read-and-write operations to ensure efficient information storage and maintain temporal coherence in a novel read-perceive-write cycle. In Stage 2 (green arrows), ReWind utilizes a dynamic frame selection (DFS) mechanism to incorporate detailed spatial information for key moments. Finally (red arrow), the memory content, selected frames, and user instruction are combined to form the input for the language model. In (b), the perceiver layer with learnable queries and text-conditioned visual features for instruction-guided encoding.

2.2 Long Video Understanding

Recent works have proposed diverse solutions to address the challenges pertinent to long video understanding. Memory-based approaches include MovieChat [21], which employs a dual memory module with a FIFO queue for short-term memory and a consolidation module for long-term memory, and MA-LMM [10], which introduces a hierarchical memory module. TimeChat [20] incorporates timestamps and transcribed speech for time-aware encoding. However, these approaches [21, 10, 20] significantly compress temporal information, compromising the understanding of event dynamics. Alternative approaches focus on efficient frame representation. LLaMA-VID [15] efficiently represents each frame with only two tokens. VTimeLLM [11] introduces temporal-focused training and uses only the class tokens as frame representations. Yet, both VTimeLLM and LLaMA-VID process frames in isolation, failing to capture coherent temporal representations.

Unlike previous works that either significantly compress temporal information [21, 10, 20] or process frames in isolation [15], ReWind distinguishes itself by proposing a novel memory-based architecture with a read-perceive-write cycle that selectively stores instruction-relevant visual information while enabling efficient processing of long videos and maintaining temporal fidelity. As opposed to approaches that maintain fixed dense representations [21, 10], ReWind employs an adaptive frame selection mechanism that enriches memory representations with detailed spatial information only for instruction-relevant key moments.

3 Method

ReWind enables efficient long-video understanding through a novel memory-based architecture that maintains temporal fidelity while selectively storing instruction-relevant information. As shown in Fig. 2 (a), the architecture implements this through two-stage processing. Stage-1, namely read-perceive-write cycle, comprises: (1) a vision encoder, (2) a text encoder for instruction processing, (3) a instruction-aware perceiver that bridges visual features and LLM understanding, and (4) a memory module with learnable read and write operations for efficient information storage. Stage-2, the Selection, comprises a dynamic frame selection (DFS) mechanism that enriches memory representations with detailed spatial information for key moments. These two stages work in concert to enable the LLM to generate responses based on both the instruction and video content. We explain Stage-1 components in Sections 3.1 and 3.2, and the DFS in Section 3.3. Finally, we detail the LLM input formation and the training strategy in Sections 3.4 and3.5.

3.1 Visual Feature Extraction

To process long videos under GPU memory constraints, ReWind divides input video V𝑉Vitalic_V containing T𝑇Titalic_T frames into N𝑁Nitalic_N sub-clips S=s1,s2,,sN𝑆subscript𝑠1subscript𝑠2subscript𝑠𝑁S={s_{1},s_{2},\dots,s_{N}}italic_S = italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, each with F𝐹Fitalic_F frames (N=T/F𝑁𝑇𝐹N=T/Fitalic_N = italic_T / italic_F). For each frame fijsubscript𝑓𝑖𝑗f_{ij}italic_f start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT in sub-clip sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, a pre-trained ViT-G/14 encoder from EVA-CLIP [8] extracts visual features as a sequence of tokens Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

3.2 Instructed Memory Architecture

At the core of ReWind lies its novel read-perceive-write cycle that enables progressive video understanding while maintaining temporal fidelity. This cycle orchestrates the interaction between a long-term memory bank for storing distilled video representations, an instruction-aware temporal perceiver for temporal representation construction, and learnable read-write functions for memory interaction, as illustrated in Fig. 3.

To effectively process long videos, ReWind’s memory module selectively stores instruction-relevant information from incoming frames while enabling progressive information accumulation. The module centers on a long-term memory bank M𝑀Mitalic_M and learnable read-write functions that bridge memory content with perceiver features. The read operation, using learnable queries QRsubscript𝑄𝑅Q_{R}italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, first retrieves historical context from M𝑀Mitalic_M. These read queries then initialize the perceiver’s queries for instruction-guided visual feature extraction from ViT outputs. Finally, learnable write queries QWsubscript𝑄𝑊Q_{W}italic_Q start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT distill the perceiver’s output through cross-attention for efficient storage in M𝑀Mitalic_M. Additionally, original visual features are preserved in a feature buffer for potential detailed spatial analysis. This tight integration between memory operations and the perceiver ensures temporally coherent representations while maintaining computational efficiency.

Refer to caption
Figure 3: Rewind’s read-perceive-write simplified workflow.

3.2.1 Read Operation

The read operation aims to facilitate dynamic, context-aware feature extraction. This interface enables continuous interaction between the feature extraction process and the evolving memory content in M𝑀Mitalic_M. Specifically, as the memory gets populated with information from previously processed video segments, the read interface uses a fixed number NRsubscript𝑁𝑅N_{R}italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT (i.e., 32) of read queries QRsubscript𝑄𝑅Q_{R}italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT to actively retrieve relevant context through a cross-attention mechanism between them and the contents of M𝑀Mitalic_M as depicted in Fig. 2 (a). This retrieval process enables the feature extraction pipeline to remain informed by the most recent knowledge stored in the memory. These context-enriched read queries then guide the perceiver’s processing of incoming frames, ensuring that feature extraction maintains awareness of previously stored temporal information.

3.2.2 Perceive Operation

The perceive operation, performed by a perceiver block, bridges visual features and the LLM’s understanding through instruction-aware temporal modeling. As illustrated in Figure 2 (b), the design of perceiver allows for effective integration of instruction-guided features with historical context. As such, it utilizes a set of NQsubscript𝑁𝑄N_{Q}italic_N start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT learnable queries, Q𝑄Qitalic_Q, to project Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT into a latent space that LLM can understand. These learnable queries guide the extraction of relevant information from the visual features.

A crucial aspect of ReWind’s design is the synergistic relationship between the perceiver block and the memory module. The learnable Q𝑄Qitalic_Q in the perceiver block share the same weights and are initialized with the current content of the read queries QRsubscript𝑄𝑅Q_{R}italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT obtained by the cross attention between QRsubscript𝑄𝑅Q_{R}italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and the contents of M𝑀Mitalic_M (note that this implies NQ=NRsubscript𝑁𝑄subscript𝑁𝑅N_{Q}=N_{R}italic_N start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT). This creates a continuous pipeline, allowing the feature extraction process to dynamically interact with the memory and access relevant context. To further enhance this process, the perceiver block incorporates the textual embedding of the user instruction, denoted as I𝐼Iitalic_I. This embedding is obtained by encoding the input text query using a pre-trained BERT encoder. I𝐼Iitalic_I is then appended to the visual features Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to form extended representations, denoted as Pij^^subscript𝑃𝑖𝑗\hat{P_{ij}}over^ start_ARG italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG. As depicted in Fig. 2 (b), the perceiver block employs a cross-attention mechanism between Q𝑄Qitalic_Q and the combined visual-textual features Pij^^subscript𝑃𝑖𝑗\hat{P_{ij}}over^ start_ARG italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG. This cross-attention allows the model to selectively attend to the most relevant aspects of the visual information conditioned on the user instruction. The output of this mechanism is a set of refined frame-level representations, denoted as Q^ijsubscript^𝑄𝑖𝑗\hat{Q}_{ij}over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

Since Q𝑄Qitalic_Q shares weights and content with updated read queries QRsubscript𝑄𝑅Q_{R}italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, the feature extraction process is also conditioned by the memory content. Specifically, as denoted in Fig. 2 (a), QRsubscript𝑄𝑅Q_{R}italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT are always updated with the latest memory content before being used by the perceiver, assuming M𝑀Mitalic_M has content in it. This ensures a progressive construction of robust and temporally informed representations of the video content at the frame level. Finally, to capture temporal relationships within the clip sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the perceiver performs self-attention on the temporal dimension of the refined representations Q^ijsubscript^𝑄𝑖𝑗\hat{Q}_{ij}over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT of consecutive frames. This temporal attention allows the model to understand how events unfold inside the clip. Note that unlike previous video Q-formers [27, 21] that produce clip-level representations, our perceiver processes each frame individually and then performs temporal attention. This approach preserves temporal fidelity and enables a more nuanced understanding of event dynamics while keeping a robust representation of each frame.

3.2.3 Write Operation

The write operation efficiently distills and stores the perceiver’s frame-level output Q^ijsubscript^𝑄𝑖𝑗\hat{Q}_{ij}over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT in memory. While these outputs capture rich spatial and contextual information, their sheer number of queries impedes processing long videos. To address this, ReWind’s learnable writing mechanism compresses the visual information into a more efficient representation. This mechanism utilizes a set of learnable write queries, QWsubscript𝑄𝑊Q_{W}italic_Q start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT, to distill the scene information into a much smaller number of tokens (e.g., 2 tokens per frame). Specifically, ReWind employs cross-attention between QWsubscript𝑄𝑊Q_{W}italic_Q start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and Q^ijsubscript^𝑄𝑖𝑗\hat{Q}_{ij}over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to generate compact per-frame representations Q^ijWsubscriptsuperscript^𝑄𝑊𝑖𝑗\hat{Q}^{W}_{ij}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. These representations are then stored in the memory bank M𝑀Mitalic_M in temporal order, enabling the progressive construction of temporally coherent video representations. Additionally, the original visual features Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT for each frame are stored in a separate feature buffer, preserving the detailed spatial information for later use (see Section 3.3). This feature buffer is a simple storage container and does not impact computational resources.

3.3 Dynamic Frame Selection

While M𝑀Mitalic_M efficiently stores a compressed video representation, certain instructions demand high spatial resolution at specific moments. ReWind addresses this through a Dynamic Frame Selection (DFS) mechanism that identifies instruction-relevant key frames using memory contents M𝑀Mitalic_M and instruction encoding I𝐼Iitalic_I. This two-stage selection process, comprising instruction-based selection and clustering, occurs during Stage-2 of ReWind after full video processing and information storage in memory.

Instruction based selection. The first stage prioritizes frames based on their relevance to the user’s instruction by leveraging I𝐼Iitalic_I, and contents of M𝑀Mitalic_M. Given frame representations {mt}t=1TMsuperscriptsubscriptsubscript𝑚𝑡𝑡1𝑇𝑀\{m_{t}\}_{t=1}^{T}\in M{ italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ italic_M and averaged instruction encoding I¯¯𝐼\overline{I}over¯ start_ARG italic_I end_ARG, we compute the attention matrix between I¯¯𝐼\overline{I}over¯ start_ARG italic_I end_ARG and the contents of M𝑀Mitalic_M. The top L𝐿Litalic_L frames with the highest response scores to the instruction, denoted as Z={zl}l=1L𝑍superscriptsubscriptsubscript𝑧𝑙𝑙1𝐿Z=\{z_{l}\}_{l=1}^{L}italic_Z = { italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, are then selected for further processing in the second DFS stage.

Clustering. The second stage employs a K-nearest neighbors density peaks clustering approach inspired by DPC-KNN [7, 12] to identify Kcsubscript𝐾𝑐K_{c}italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT representative frames from Z𝑍Zitalic_Z. For each token zlsubscript𝑧𝑙z_{l}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, we first compute its local density σlsubscript𝜎𝑙\sigma_{l}italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT based on its K𝐾Kitalic_K-nearest neighbors:

σl=exp(1KzkKNN(zl,Z)zkzl2),subscript𝜎𝑙𝑒𝑥𝑝1𝐾subscriptsubscript𝑧𝑘𝐾𝑁𝑁subscript𝑧𝑙𝑍superscriptnormsubscript𝑧𝑘subscript𝑧𝑙2\sigma_{l}=exp(-\frac{1}{K}\sum_{z_{k}\in KNN(z_{l},Z)}||z_{k}-z_{l}||^{2}),italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_e italic_x italic_p ( - divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_K italic_N italic_N ( italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_Z ) end_POSTSUBSCRIPT | | italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (1)

where KNN(zlsubscript𝑧𝑙z_{l}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, Z𝑍Zitalic_Z) returns the K𝐾Kitalic_K-nearest neighbors of zlsubscript𝑧𝑙z_{l}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT from Z{zl}𝑍subscript𝑧𝑙Z-\{z_{l}\}italic_Z - { italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT }111Z{zl}𝑍subscript𝑧𝑙Z-\{z_{l}\}italic_Z - { italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } denotes removing zlsubscript𝑧𝑙z_{l}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT from the set Z𝑍Zitalic_Z.. Then, we compute each token’s distance index ρlsubscript𝜌𝑙\rho_{l}italic_ρ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT of zlsubscript𝑧𝑙z_{l}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT:

ρl={minj:σj>σlzjzl2if  j  s.t. σj>σl,max𝑗zjzl2otherwise.subscript𝜌𝑙cases:𝑗subscript𝜎𝑗subscript𝜎𝑙superscriptnormsubscript𝑧𝑗subscript𝑧𝑙2if  j  s.t. subscript𝜎𝑗subscript𝜎𝑙𝑗superscriptnormsubscript𝑧𝑗subscript𝑧𝑙2𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒\rho_{l}=\begin{cases}\underset{j:\sigma_{j}>\sigma_{l}}{\min}||z_{j}-z_{l}||^% {2}&\textit{if }\exists\textit{ j }\textit{ s.t. }\sigma_{j}>\sigma_{l},\\ \phantom{j}\underset{j}{\max}\phantom{j}||z_{j}-z_{l}||^{2}&otherwise.\\ \end{cases}italic_ρ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = { start_ROW start_CELL start_UNDERACCENT italic_j : italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG | | italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL if ∃ italic_j italic_s.t. italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL underitalic_j start_ARG roman_max end_ARG | | italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e . end_CELL end_ROW (2)

In essence, ρlsubscript𝜌𝑙\rho_{l}italic_ρ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT represents the distance between the given token zlsubscript𝑧𝑙z_{l}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT from other high-density tokens. We then use σl×ρlsubscript𝜎𝑙subscript𝜌𝑙\sigma_{l}\times\rho_{l}italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_ρ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as the weighted density index for each zlsubscript𝑧𝑙z_{l}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and sort them in descending order. The top Kcsubscript𝐾𝑐K_{c}italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT frames with the highest indices are selected as the most representative video moments related to the user instruction. We use the indices of these Kcsubscript𝐾𝑐K_{c}italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT centers to extract the representations with a higher spatial resolution for each frame from the feature buffer containing ViT encodings. Finally, these representations are pooled to a desired number of tokens per frame denoted as Z^^𝑍\hat{Z}over^ start_ARG italic_Z end_ARG. This mechanism effectively “rewinds” through the video’s latent space to identify key moments, inspiring ReWind’s name.

3.4 Large Language Model

The input to the LLM is constructed by concatenating M𝑀Mitalic_M with the dense representations Z^^𝑍\hat{Z}over^ start_ARG italic_Z end_ARG, separated by a special token τ𝜏\tauitalic_τ: <m0,m1,,τ,Z^><m_{0},m_{1},\dots,\tau,\hat{Z}>< italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_τ , over^ start_ARG italic_Z end_ARG >. The role of τ𝜏\tauitalic_τ is purely to separate the memory content with progressive temporal information from the DFS frames where the spatial information is prioritized. The video content is then combined with the text instruction and given in input to the LLM.

3.5 Training

Instruction tuning has been a crucial training strategy for VLMs, especially for the QA tasks, as demonstrated from previous works [12, 15]. Inspired by this, our training strategy is divided into two stages.

Multimodal Pretraining Stage. During the initial stage, we conduct standard multimodal alignment, keeping the network components except the perceiver frozen. This phase aims to empower our ReWind to effectively capture semantic visual information without any compromise in the performance of the overall pipeline. Specifically, it involves contrastive learning utilizing the SigLIP [26] loss between perceiver projections and caption encodings from BERT.

Instruction Tuning Stage. The second training stage engages the memory module, the DFS, and the LLM (fine-tuned using LoRA). This phase employs the instruction-tuning strategy on multimodal instruction-tuning datasets, aiming to integrate all network components seamlessly for the VQA and temporal grounding tasks.

4 Experiments

Model Num Frames Num Tokens Global VQA Breakpoint VQA Global Generation Breakpoint Generation
Accuracy Score Accuracy Score CI DO CU TU CO CI DO CU TU CO
Video LLaMA [27] 32 32 51.4 3.10 38.2 2.31 3.30 2.53 3.28 2.77 3.42 2.42 2.85 2.87 2.00 2.87
Video-ChatGPT [18] 100 356 44.2 2.71 49.8 2.71 2.48 2.78 3.03 2.48 2.99 3.11 3.32 3.29 2.62 3.29
Video Chat [14] 32 3072 61.0 3.34 48.3 2.43 3.26 3.20 3.38 2.97 3.47 2.96 3.09 3.24 2.46 3.22
MovieChat [21] 2048 8192 67.8 3.81 50.4 2.96 3.32 3.28 3.44 3.06 3.48 3.07 3.24 3.31 2.70 3.45
ReWind (Ours) 548* 1184* 80.6 4.46 57.2 3.4 4.18 4.00 4.24 4.02 3.54 3.41 3.37 3.64 2.97 3.61
Table 1: Evaluation for long VQA on MovieChat-1K test set with GPT-3.5. The best result is in bold, and the second best is underlined. Rewind uses a fixed frame rate of 1fps, so the number of frames in input varies based on the video length. ’*’ means the quantity is variable.
Model Charades-STA
R@0.3 R@0.5 R@0.7 mIoU
Video Chat [14] 9.0 3.3 1.3 6.5
Video LLaMA [27] 10.4 3.8 0.9 7.1
Video-ChatGPT [18] 20.0 7.7 1.7 13.7
GroundingGPT [16] - 29.6 11.9 -
TimeChat [20] - 32.2 13.4 -
VTimeLLM [11] 51.0 27.5 11.4 31.2
ReWind (Ours) 59.0 41.6 20.53 39.3
Table 2: Temporal video grounding on Charades-STA. Best results are emphasized in bold, and second-bests are underlined. We compare against works that use only video inputs. De-emphasized results use transcribed speech in input.

4.1 Experimental Setup and Datasets

Model Settings. ReWind’s architecture is built upon the EVA-02 vision encoder (ViT-G/14) [8] and a 7B-parameter LLaMA-2 LLM [23]. The perceiver block, illustrated in Fig. 2 (b), consists of 8 sequential layers. Additionally, we utilize 32 queries for reading and perceiving information and two write queries to ensure efficient memory storage. The DFS mechanism selects 64 frames in the first selection phase and then refines this to 8 representative frames. These selected frames are then pooled into 32 tokens per frame before being integrated with the memory content.

Training Setup and Data. We pretrain ReWind on 100K video-caption pairs randomly selected from the WebVid2.5M [2] and Panda70M [6] datasets. This stage involves 10K steps with a batch size of 64, using the AdamW optimizer and cosine scheduling. The learning rate is set to 1e-4 with 500 warmup steps. For instruction tuning, we combine multimodal instruction data from VideoChatGPT [18] with the same 100,000 video-caption pairs used in the pretraining stage. All frames are resized to 224×\times×224 pixels. During this stage, ReWind is trained for 100,000 steps with a batch size of 64, a learning rate of 5e-5, and 2,000 warmup steps, using the same optimizer and scheduler as in pretraining. We utilize LoRA for the LLM with a rank of 64 and alpha of 32. For temporal grounding tasks, ReWind undergoes additional fine-tuned on DiDemo [1] and ActivityNet [4] datasets with manually annotated QA pairs with temporal boundaries for an extra 15K steps using the same optimizer and learning rate. Remarkably, our model can obtain great results while being trained on only 8×\times×V100 GPUs. Further details regarding the data and training setup can be found in the supplementary material.

4.2 Datasets and Evaluation

Long Video. We evaluate ReWind’s performance on two tasks: VQA and temporal grounding. For VQA, we use the MovieChat-1K test set[14], with a video average length of 9.13 minutes. We assess VQA performance using three metrics: accuracy, score, and generation quality, determined by comparing the generated answer to the ground truth (GT) using GPT-3.5. Accuracy measures the exact matches between answers and GT, while the score measures their proximity in meaning with a score from 0 to 5. Generation quality is evaluated using the protocol proposed in [14] based on five metrics: correctness of information (CI), detailed orientation (DO), contextual understanding (CU), temporal understanding (TU), and consistency (CO). Each metric is assigned a score from 0 to 5 by GPT-3.5 by comparing the generated answer and the GT. For temporal grounding, we use Charades-STA [9]. We measure recall at various thresholds (30-70%) and mean IoU (mIoU) to compare the predicted time intervals with the GT.

Short Video. We evaluate ReWind’s performance on short-video benchmarks using the VideoChatGPT dataset and generation quality evaluation protocol.

Model LLM Backbone CI DO CU TU CO AVG
Video Chat [14] Vicuna-7B ViT-G 2.23 2.50 2.53 1.94 2.24 2.29
Video LLaMA [27] Vicuna-7B ViT-G 1.96 2.18 2.16 1.82 1.79 1.98
Video-ChatGPT [18] Vicuna-7B ViT-L 2.40 2.52 2.62 1.98 2.37 2.38
LLaMA Adapter [28] LLaMA-7B ViT-L 2.03 2.32 2.30 1.98 2.15 2.16
Chat-UniVi [12] Vicuna1.5-7B ViT-L 2.89 2.91 3.46 2.39 2.81 2.89
VTimeLLM [11] Vicuna1.5-7B ViT-L 2.78 3.10 3.40 2.49 2.47 2.85
MovieChat [21] LLaMA2-7B ViT-G 2.76 2.93 3.01 2.24 2.42 2.67
LLaMA-VID [15] Vicuna-7B ViT-G 2.96 3.00 3.53 2.46 2.51 2.89
ReWind (Ours) LLaMA2-7B ViT-G 2.91 2.85 3.42 2.71 2.68 2.91
Table 3: Evaluation for short VQA on VideoChatGPT test set with GPT-3.5. The best result in bold, and the second best underlined.

4.3 Results on Long Videos

VQA. The MovieChat-1K dataset is a challenging long-video benchmark with an average video length of 9.13 minutes. It contains 1,000 videos, each with multiple open-ended questions in two settings: global and breakpoint. The global setting requires processing the entire video and answering questions about its content, while the breakpoint mode involves processing the video up to a specific timestamp and answering questions about the event at that point. Table 1 presents the results for both settings on the test set, showcasing ReWind’s performance on generation quality and accuracy-score metrics. The analysis reveals that ReWind significantly outperforms previous approaches across all metrics, particularly surpassing MovieChat [21], specifically designed for long videos. Notably, ReWind achieves these superior results while utilizing approximately 1/8 of the tokens and 1/4 of the frames required by the prior best model. This demonstrates ReWind’s ability to effectively model temporal relationships over extended sequences and its efficiency in encoding information with a minimal number of tokens.

Temporal Grounding. Charades-STA [9] test set contains manually annotated QA pairs with temporal boundaries, providing a challenging testbed for assessing a model’s understanding of event dynamics. We benchmark ReWind against existing VLM approaches and report results in Table 2. Notably, ReWind significantly outperforms all previous models that rely solely on video input across all metrics. This highlights ReWind’s exceptional ability to accurately track and interpret the temporal progression of events.

4.4 Results on Short Videos

To further assess ReWind’s capabilities, we evaluate its performance on the VideoChatGPT QA test set, which features open-ended questions with more detailed answers. Utilizing the generation evaluation protocol, the results are presented in Table 3. ReWind achieves a higher overall average score (AVG) than all previous short and long-term methods, demonstrating strong performance even in short videos. Notably, ReWind excels in the temporal understanding (TU) metric, confirming its superior ability to capture and comprehend temporal information. More experiments can be found on the supplementary material.

5 Ablation

Core Mechanisms. Table 4 presents an ablation of ReWind’s core components — the memory module, and the DFS mechanism — on long videos. We establish a baseline model that uses 64 uniformly sampled frames and incorporates the perceiver block as an adapter layer, with each frame encoded using 32 tokens. We then progressively incorporate the memory and DFS to complete ReWind’s architecture. Note that when we add the components, the video is processed at 1 fps, and each frame is encoded with 2 tokens to align with our design. The results demonstrate that memory and DFS significantly contribute to ReWind’s performance on long videos. To assess the effectiveness of these components on shorter videos, we conduct a similar ablation using the VideoChatGPT dataset, which consists of short videos, and report the findings in Table 5. Notably, combining memory and DFS leads to substantial improvements over the baseline, even when applied to short videos.

Model Global Breakpoint
Accuracy Score Accuracy Score
Baseline 61.5 3.21 49.1 2.62
Mem 76.8 4.21 52.1 3.11
Mem+DFS 80.6 4.46 57.2 3.40
Table 4: Ablation study on how the memory mechanism (Mem) and the DFS affect the performance of Rewind on MovieChat-1K.
Model CI DO CU TU CO AVG
Baseline 2.54 2.72 3.27 2.46 2.60 2.72
Mem 2.76 2.56 3.13 2.58 2.62 2.73
Mem+DFS 2.91 2.85 3.42 2.71 2.68 2.91
Table 5: Ablation study on how the memory mechanism (Mem) and the DFS affect the performance of Rewind on VideoChatGPT.

Perceiver. In our architectural design, the perceiver layer is conditioned on the text and past information through reading queries. We validate the effects these elements have on the perceiver in Table 6. Particularly, we start with ReWind without DFS and experiment with different conditions.

Text Read Global Breakpoint
Accuracy Score Accuracy Score
74.7 4.06 51.2 3.09
69.1 3.76 48.7 2.81
76.8 4.21 52.1 3.11
Table 6: Ablation study on conditioning the perceiver with text and read information (past content).

Number of Frames vs. Performance and Memory. Figure 4 illustrates the impact of varying the number of input frames (ranging from 64 to 1024) on ReWind’s performance and GPU memory requirements. ReWind’s performance improves as the number of frames increases, reaching an optimal point at 512 frames (approximately 1 fps sampling). Beyond this point, performance declines when using 1024 frames (around 2 fps sampling). This decline is likely due to the deviation from ReWind’s training regime and the introduction of high redundancy in token representations.

Memory consumption, measured using 16-bit precision, peaks at 29GB for 1024 frames. Notably, ReWind can process a  10-minute video with less than 25GB of memory, making it compatible with standard end-user GPUs. Additionally, peak memory consumption is influenced by the choice of the LLM, and the number of input tokens. Utilizing different LLM quantizations (e.g., 8-bit) can substantially reduce memory requirements.

Refer to caption
Figure 4: Ablation study on ReWind’s performance and memory requirements in MovieChat-1K test set for different numbers of input frames, ranging from 64 to 1024, and 16-bit precision.
Refer to caption
Figure 5: Qualitative result on VQA. We input ReWind with the illustrated video of +4 minutes and make two types of questions regarding the video content. On the first answer, we showcase ReWind’s ability to understand the extended context and at the same time highlight in red the hallucination produced by it. In the second scenario, we highlight ReWind’s ability to focus on different aspects of the video by matching some of the frames selected from DFS for the given scenario and the corresponding details on the generated answer.

Hyperparameters. We ablate on the hyperparameters of ReWind to assess their impact. Initially, we vary the number of tokens per frame stored in memory without DFS to clearly understand its individual effect. The results are presented in Table 7. Furthermore, in Table 8, we investigate DFS-specific hyperparameters: the number of selected tokens during the instruction-selection stage (L𝐿Litalic_L) and the number of final selected frames (Kcsubscript𝐾𝑐K_{c}italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT).

DFS vs. Uniform Sampling. Finally, we ablate the benefits of having DFS to uniform sampling for long videos. In both scenarios, the number of selected frames is 8. The outcomes of this comparison are detailed in Table 9.

TPF Global Breakpoint
Accuracy Score Accuracy Score
1 71.1 3.91 49.0 2.87
2 76.8 4.21 52.1 3.11
3 79.2 4.34 54.7 3.34
4 79.7 4.41 54.9 3.41
Table 7: Ablation study on the numbers of tokens per frame (TPF) stored in memory (write queries) on MovieChat-1K test set.
L𝐿Litalic_L Kcsubscript𝐾𝑐K_{c}italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT Global Breakpoint
Accuracy Score Accuracy Score
16 8 77.1 4.25 39.8 2.7
32 8 78.1 4.32 48.6 2.9
64 8 80.6 4.46 57.2 3.4
128 8 80.1 4.45 56.1 3.4
64 4 77.9 4.30 52.1 3.11
64 8 80.6 4.46 57.2 3.40
64 16 81.5 4.52 55.2 3.18
Table 8: Ablation study on the hyperparameters of the DFS mechanism. We explore the effects of varying the number of frames selected in the instruction-based selection stage.
FS Strategy Global Breakpoint
Accuracy Score Accuracy Score
Uniform 77.5 4.29 54.1 3.31
DFS 80.6 4.46 57.2 3.40
Table 9: Ablation on frame selection (FS) strategy on MovieChat-1K test set. Comparison between DFS and uniform sampling.

5.1 Qualitative Results

Fig. 5 provides qualitative examples showcasing ReWind’s ability to comprehend long videos while preserving fine-grained details. We pose two types of questions: (1) a comprehensive description of the entire video content, where ReWind captures the overall narrative and key events, and (2) a question about the changing weather throughout the video, testing ReWind’s ability to track and recall information across different scenes. We highlight the model’s hallucinations in red. In (2), we highlight selected frames from DFS and their corresponding text using matching colors.

6 Conclusions

This work introduces ReWind, a novel memory-based vision-language model that enables an efficient understanding of long videos while maintaining temporal fidelity. ReWind features a dynamic learnable memory module with an innovative read-perceive-write cycle for instruction-guided feature encoding and robust temporal representation construction. Additionally, we propose an adaptive frame selection mechanism guided by memory contents to identify instruction-relevant key moments, enriching memory representations with detailed spatial information. Our evaluation demonstrates significant performance gains on various long-video benchmarks, including visual question answering and temporal grounding tasks. These results highlight ReWind’s effectiveness in comprehensive video understanding and its potential for real-world applications requiring deep temporal reasoning over extended video content.

Acknowledgments

We gratefully acknowledge Sami Remes for his invaluable contributions to this work. His expertise and insights, particularly during his time with the video understanding team at Huawei Helsinki Research Center, were instrumental in designing and implementing the perceiver block. We extend our sincere thanks for his dedication and support.

References

  • Anne Hendricks et al. [2017] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Int. Conf. Comput. Vis., pages 5803–5812, 2017.
  • Bain et al. [2021”] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Int. Conf. Comput. Vis., pages 1–8, 2021”.
  • Bommasani et al. [2021] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  • Caba Heilbron et al. [2015] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Int. Conf. Comput. Vis., pages 961–970, 2015.
  • Chen and Dolan [2011] David L. Chen and William B. Dolan. Collecting highly parallel data for paraphrase evaluation. In Association for Computational Linguistics, pages 1–10, 2011.
  • Chen et al. [2024] Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13320–13331, 2024.
  • Du et al. [2016] Mingjing Du, Shifei Ding, and Hongjie Jia. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowledge-Based Systems, 99:135–145, 2016.
  • Fang et al. [2023] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In IEEE Conf. Comput. Vis. Pattern Recog., pages 19358–19369, 2023.
  • Gao et al. [2017] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Int. Conf. Comput. Vis., pages 5267–5275, 2017.
  • He et al. [2024] Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13504–13514, 2024.
  • Huang et al. [2024] Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In IEEE Conf. Comput. Vis. Pattern Recog., pages 14271–14280, 2024.
  • Jin et al. [2024] Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13700–13710, 2024.
  • Li et al. [2023a] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Int. Conf. Mach. Learn., pages 19730–19742. PMLR, 2023a.
  • Li et al. [2023b] Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1–8, 2023b.
  • Li et al. [2024a] Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In Eur. Conf. Comput. Vis., pages 1–14, 2024a.
  • Li et al. [2024b] Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou, Junting Pan, Zefeng Li, Vu Tu, et al. Groundinggpt: Language enhanced multi-modal grounding model. In Association for Computational Linguistics, pages 6657–6678, 2024b.
  • Lin et al. [2023] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. In Conf. on Empirical Methods in Nat. Lang. Process., pages 1–10, 2023.
  • Maaz et al. [2024] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Association for Computational Linguistics, pages 12585–12602, 2024.
  • Piergiovanni et al. [2024] AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S Ryoo, Victor Gomes, and Anelia Angelova. Mirasol3b: A multimodal autoregressive model for time-aligned and contextual modalities. In IEEE Conf. Comput. Vis. Pattern Recog., pages 26804–26814, 2024.
  • Ren et al. [2024] Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 14313–14323, 2024.
  • Song et al. [2024] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 18221–18232, 2024.
  • Team et al. [2024] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, pages 1–20, 2024.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, pages 1–20, 2023.
  • Vaswani [2017] A Vaswani. Attention is all you need. In Adv. Neural Inform. Process. Syst., pages 1–10, 2017.
  • Xu et al. [2016] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5288–5296, 2016.
  • Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Int. Conf. Comput. Vis., pages 11975–11986, 2023.
  • Zhang et al. [2023] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. In Conf. on Empirical Methods in Nat. Lang. Process., pages 543–553, 2023.
  • Zhang et al. [2024] Renrui Zhang, Jiaming Han, Chris Liu, Aojun Zhou, Pan Lu, Yu Qiao, Hongsheng Li, and Peng Gao. Llama-adapter: Efficient fine-tuning of large language models with zero-initialized attention. In Int. Conf. Learn. Represent., pages 1–10, 2024.
  • Zhou et al. [2022] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. Int. J. Comput. Vis., 130(9):2337–2348, 2022.
  • Zhu et al. [2024] Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. In Int. Conf. Learn. Represent., pages 1–10, 2024.