ReWind: Understanding Long Videos with Instructed Learnable Memory

Anxhelo Diko¹ Tinghuai Wang²¹¹footnotemark: 1 Wassim Swaileh² Shiyan Sun² Ioannis Patras²
¹La Sapienza University of Roma ²Huawei Helsinki Research Center
diko@di.uniroma1.it {tinghuaiwang,shiyansun,wassim.swaileh,ioannis.patras}@huawei.com Equal ContributionWork done while at Huawei

Abstract

Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. However, existing VLMs struggle with long videos due to computational inefficiency, memory limitations, and difficulties in maintaining coherent understanding across extended sequences. To address these challenges, we introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. ReWind operates in a two-stage framework. In the first stage, ReWind maintains a dynamic learnable memory module with a novel read-perceive-write cycle that stores and updates instruction-relevant visual information as the video unfolds. This module utilizes learnable queries and cross-attentions between memory contents and the input stream, ensuring low memory requirements by scaling linearly with the number of tokens. In the second stage, we propose an adaptive frame selection mechanism guided by the memory content to identify instruction-relevant key moments. It enriches the memory representations with detailed spatial information by selecting a few high-resolution frames, which are then combined with the memory contents and fed into a Large Language Model (LLM) to generate the final answer. We empirically demonstrate ReWind’s superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks. Notably, ReWind achieves a +13% score gain and a +12% accuracy improvement on the MovieChat-1K VQA dataset and an +8% mIoU increase on Charades-STA for temporal grounding.

1 Introduction

Large Language Models (LLMs) [23, 22] have demonstrated remarkable capabilities at human language processing [24, 3]. However, these models are limited to text-based inputs and, therefore, oblivious to real-world, multi-sensory information. To address this limitation, researchers are actively developing Multimodal LLMs (MLLMs) capable of processing signals from multiple and diverse modalities [13, 15, 18], including images, video, and audio. This emerging field holds immense potential for applications such as visual question answering (VQA), real-time interfaces for autonomous agents, and generating detailed scene descriptions for the visually impaired.

Refer to caption — Figure 1: ReWind is a memory-based VLM framework designed for long video understanding (10+ minutes), specialized in VQA and temporal grounding. The highlighted frames are selected from ReWind’s dynamic frame selection mechanism.

Recent research in MLLMs has predominantly concentrated on Vison-Language Models (VLMs) [27, 28, 11], which typically combine pre-trained LLMs with visual encoders that encode and feed to them visual information. However, existing VLMs face two major challenges in processing long videos. First, their self-attention mechanisms require substantial memory that scales quadratically with the number of tokens, making long video processing computationally intensive. Second, these models struggle to effectively model temporal dependencies over extended sequences. To address these challenges, recent efforts have proposed using memory modules to enhance the capability of VLMs [21, 10]. However, current memory modules often serve as storage units and lack the ability to discern and retain information pertinent to the task or user instructions. Moreover, these models tend to compress temporal information heavily [21], sacrificing the fidelity of the temporal dynamics and overlooking critical details in the video’s narrative. Additionally, current models rely on fixed dense spatial representations per frame [10, 21], increasing memory requirements: by treating all frames equally, they store unnecessary details for non-essential moments, increasing memory demands and limiting the model’s ability to focus on critical events for accurate video comprehension.

To address these long video challenges, we introduce ReWind, a novel memory-based framework that operates in two stages, advancing the state of the art with key innovations. In the first stage (Stage-1 in Fig. 2), through a learnable memory module, Rewind enables instruction-guided feature encoding and storage into a memory bank of coherent temporal information. At its core, ReWind features a novel read-perceive-write cycle: First (Read Cross Attention), a read operation looks at historical context from memory and produces fixed-size read queries. Then those are used as queries in a perceiver unit (Perceiver Block) that processes tokens from the encoder of incoming frames. Unlike previous Q-Former approaches (e.g.,[27]) that compress information at clip-level, our novel design allows memory-informed processing of the incoming frames and preserves temporal fidelity. Finally, the perceiver’s representations of the input tokens flow into the write operation (Write Cross Attention), where learnable write queries distill and filter information through. The resulting compact representations are then stored in memory, enabling ReWind to progressively build coherent temporal representations while avoiding the compression issues present in previous works [21, 10, 27]. Crucially, in this stage, we avoid cross-attention between the memory and the video stream, as well as self-attention within the stream tokens with high computational demand. In the second stage (Stage-2 in Fig. 2), ReWind ’rewinds’ the video stream and dynamically selects frames by a selection mechanism guided by the memory contents and the user instructions. The selection mechanism operates on high spatial resolution tokens from the input stream so that after selection, tokens from both the memory bank and the dense selection outputs are fed into an LLM that generates the response. By contrast to previous works that maintain fixed-size dense representations for each frame [21, 27], this selection strategy incorporates detailed spatial information only for relevant key events, resulting in reduced memory requirements.

In our extensive evaluations, ReWind demonstrates superior performance compared to previous state-of-the-art methods across both long and short-term video question answering [21, 5, 25, 14] and temporal grounding video benchmarks [9, 4], validating the effectiveness of our approach. Additionally, detailed ablation motivates our design choices. In summary, the main contributions of this work are threefold:

•

ReWind, a novel memory-based vision-language model that enables efficient understanding of long videos while maintaining temporal fidelity.
•

A learnable memory module with an innovative read-perceive-write cycle that enables instruction-guided feature encoding and robust temporal representation construction.
•

An adaptive frame selection mechanism that identifies instruction-relevant key moments and enriches memory representations with detailed spatial information for comprehensive video understanding.

2 Related Works

2.1 Short Video Understanding

Recent VLMs have explored various architectural approaches for video understanding. Dual-stream architectures, exemplified by Video-LLaMA [27] and VideoChat [14], process different modalities separately. The former processes both audio and visual information separately using Q-Formers [29]. The latter processes video using specialized embedding models and a perception toolkit for mixed modalities. In contrast, single-stream approaches like Video-ChatGPT [18] employ spatiotemporal pooling to capture the overall video context. Video-LLaVA [17] utilizes a LanguageBind [30] module to map multimodal inputs into a shared space. Mirasol3B [19] proposes a decoder-only model adapted to handle multimodal input, representing them in disentangled spaces. ChatUniVi [12] takes a unique approach by introducing a unified visual representation through dynamic visual tokens for both images and videos.

2.2 Long Video Understanding

Recent works have proposed diverse solutions to address the challenges pertinent to long video understanding. Memory-based approaches include MovieChat [21], which employs a dual memory module with a FIFO queue for short-term memory and a consolidation module for long-term memory, and MA-LMM [10], which introduces a hierarchical memory module. TimeChat [20] incorporates timestamps and transcribed speech for time-aware encoding. However, these approaches [21, 10, 20] significantly compress temporal information, compromising the understanding of event dynamics. Alternative approaches focus on efficient frame representation. LLaMA-VID [15] efficiently represents each frame with only two tokens. VTimeLLM [11] introduces temporal-focused training and uses only the class tokens as frame representations. Yet, both VTimeLLM and LLaMA-VID process frames in isolation, failing to capture coherent temporal representations.

Unlike previous works that either significantly compress temporal information [21, 10, 20] or process frames in isolation [15], ReWind distinguishes itself by proposing a novel memory-based architecture with a read-perceive-write cycle that selectively stores instruction-relevant visual information while enabling efficient processing of long videos and maintaining temporal fidelity. As opposed to approaches that maintain fixed dense representations [21, 10], ReWind employs an adaptive frame selection mechanism that enriches memory representations with detailed spatial information only for instruction-relevant key moments.

3 Method

ReWind enables efficient long-video understanding through a novel memory-based architecture that maintains temporal fidelity while selectively storing instruction-relevant information. As shown in Fig. 2 (a), the architecture implements this through two-stage processing. Stage-1, namely read-perceive-write cycle, comprises: (1) a vision encoder, (2) a text encoder for instruction processing, (3) a instruction-aware perceiver that bridges visual features and LLM understanding, and (4) a memory module with learnable read and write operations for efficient information storage. Stage-2, the Selection, comprises a dynamic frame selection (DFS) mechanism that enriches memory representations with detailed spatial information for key moments. These two stages work in concert to enable the LLM to generate responses based on both the instruction and video content. We explain Stage-1 components in Sections 3.1 and 3.2, and the DFS in Section 3.3. Finally, we detail the LLM input formation and the training strategy in Sections 3.4 and3.5.

3.1 Visual Feature Extraction

To process long videos under GPU memory constraints, ReWind divides input video $V$ containing $T$ frames into $N$ sub-clips $S={s_{1},s_{2},\dots,s_{N}}$ , each with $F$ frames ( $N=T/F$ ). For each frame $f_{ij}$ in sub-clip $s_{i}$ , a pre-trained ViT-G/14 encoder from EVA-CLIP [8] extracts visual features as a sequence of tokens $P_{ij}$ .

3.2 Instructed Memory Architecture

At the core of ReWind lies its novel read-perceive-write cycle that enables progressive video understanding while maintaining temporal fidelity. This cycle orchestrates the interaction between a long-term memory bank for storing distilled video representations, an instruction-aware temporal perceiver for temporal representation construction, and learnable read-write functions for memory interaction, as illustrated in Fig. 3.

To effectively process long videos, ReWind’s memory module selectively stores instruction-relevant information from incoming frames while enabling progressive information accumulation. The module centers on a long-term memory bank $M$ and learnable read-write functions that bridge memory content with perceiver features. The read operation, using learnable queries $Q_{R}$ , first retrieves historical context from $M$ . These read queries then initialize the perceiver’s queries for instruction-guided visual feature extraction from ViT outputs. Finally, learnable write queries $Q_{W}$ distill the perceiver’s output through cross-attention for efficient storage in $M$ . Additionally, original visual features are preserved in a feature buffer for potential detailed spatial analysis. This tight integration between memory operations and the perceiver ensures temporally coherent representations while maintaining computational efficiency.

3.2.1 Read Operation

The read operation aims to facilitate dynamic, context-aware feature extraction. This interface enables continuous interaction between the feature extraction process and the evolving memory content in $M$ . Specifically, as the memory gets populated with information from previously processed video segments, the read interface uses a fixed number $N_{R}$ (i.e., 32) of read queries $Q_{R}$ to actively retrieve relevant context through a cross-attention mechanism between them and the contents of $M$ as depicted in Fig. 2 (a). This retrieval process enables the feature extraction pipeline to remain informed by the most recent knowledge stored in the memory. These context-enriched read queries then guide the perceiver’s processing of incoming frames, ensuring that feature extraction maintains awareness of previously stored temporal information.

3.2.2 Perceive Operation

The perceive operation, performed by a perceiver block, bridges visual features and the LLM’s understanding through instruction-aware temporal modeling. As illustrated in Figure 2 (b), the design of perceiver allows for effective integration of instruction-guided features with historical context. As such, it utilizes a set of $N_{Q}$ learnable queries, $Q$ , to project $P_{ij}$ into a latent space that LLM can understand. These learnable queries guide the extraction of relevant information from the visual features.

A crucial aspect of ReWind’s design is the synergistic relationship between the perceiver block and the memory module. The learnable $Q$ in the perceiver block share the same weights and are initialized with the current content of the read queries $Q_{R}$ obtained by the cross attention between $Q_{R}$ and the contents of $M$ (note that this implies $N_{Q}=N_{R}$ ). This creates a continuous pipeline, allowing the feature extraction process to dynamically interact with the memory and access relevant context. To further enhance this process, the perceiver block incorporates the textual embedding of the user instruction, denoted as $I$ . This embedding is obtained by encoding the input text query using a pre-trained BERT encoder. $I$ is then appended to the visual features $P_{ij}$ to form extended representations, denoted as $\hat{P_{ij}}$ . As depicted in Fig. 2 (b), the perceiver block employs a cross-attention mechanism between $Q$ and the combined visual-textual features $\hat{P_{ij}}$ . This cross-attention allows the model to selectively attend to the most relevant aspects of the visual information conditioned on the user instruction. The output of this mechanism is a set of refined frame-level representations, denoted as $\hat{Q}_{ij}$ .

Since $Q$ shares weights and content with updated read queries $Q_{R}$ , the feature extraction process is also conditioned by the memory content. Specifically, as denoted in Fig. 2 (a), $Q_{R}$ are always updated with the latest memory content before being used by the perceiver, assuming $M$ has content in it. This ensures a progressive construction of robust and temporally informed representations of the video content at the frame level. Finally, to capture temporal relationships within the clip $s_{i}$ , the perceiver performs self-attention on the temporal dimension of the refined representations $\hat{Q}_{ij}$ of consecutive frames. This temporal attention allows the model to understand how events unfold inside the clip. Note that unlike previous video Q-formers [27, 21] that produce clip-level representations, our perceiver processes each frame individually and then performs temporal attention. This approach preserves temporal fidelity and enables a more nuanced understanding of event dynamics while keeping a robust representation of each frame.

3.2.3 Write Operation

The write operation efficiently distills and stores the perceiver’s frame-level output $\hat{Q}_{ij}$ in memory. While these outputs capture rich spatial and contextual information, their sheer number of queries impedes processing long videos. To address this, ReWind’s learnable writing mechanism compresses the visual information into a more efficient representation. This mechanism utilizes a set of learnable write queries, $Q_{W}$ , to distill the scene information into a much smaller number of tokens (e.g., 2 tokens per frame). Specifically, ReWind employs cross-attention between $Q_{W}$ and $\hat{Q}_{ij}$ to generate compact per-frame representations $\hat{Q}^{W}_{ij}$ . These representations are then stored in the memory bank $M$ in temporal order, enabling the progressive construction of temporally coherent video representations. Additionally, the original visual features $P_{ij}$ for each frame are stored in a separate feature buffer, preserving the detailed spatial information for later use (see Section 3.3). This feature buffer is a simple storage container and does not impact computational resources.

3.3 Dynamic Frame Selection

While $M$ efficiently stores a compressed video representation, certain instructions demand high spatial resolution at specific moments. ReWind addresses this through a Dynamic Frame Selection (DFS) mechanism that identifies instruction-relevant key frames using memory contents $M$ and instruction encoding $I$ . This two-stage selection process, comprising instruction-based selection and clustering, occurs during Stage-2 of ReWind after full video processing and information storage in memory.

Instruction based selection. The first stage prioritizes frames based on their relevance to the user’s instruction by leveraging $I$ , and contents of $M$ . Given frame representations $\{m_{t}\}_{t=1}^{T}\in M$ and averaged instruction encoding $\overline{I}$ , we compute the attention matrix between $\overline{I}$ and the contents of $M$ . The top $L$ frames with the highest response scores to the instruction, denoted as $Z=\{z_{l}\}_{l=1}^{L}$ , are then selected for further processing in the second DFS stage.

Clustering. The second stage employs a K-nearest neighbors density peaks clustering approach inspired by DPC-KNN [7, 12] to identify $K_{c}$ representative frames from $Z$ . For each token $z_{l}$ , we first compute its local density $\sigma_{l}$ based on its $K$ -nearest neighbors:

\sigma_{l}=exp(-\frac{1}{K}\sum_{z_{k}\in KNN(z_{l},Z)}||z_{k}-z_{l}||^{2}),

(1)

where KNN( $z_{l}$ , $Z$ ) returns the $K$ -nearest neighbors of $z_{l}$ from $Z-\{z_{l}\}$ ¹¹1 $Z-\{z_{l}\}$ denotes removing $z_{l}$ from the set $Z$ .. Then, we compute each token’s distance index $\rho_{l}$ of $z_{l}$ :

\rho_{l}=\begin{cases}\underset{j:\sigma_{j}>\sigma_{l}}{\min}||z_{j}-z_{l}||^% {2}&\textit{if }\exists\textit{ j }\textit{ s.t. }\sigma_{j}>\sigma_{l},\\ \phantom{j}\underset{j}{\max}\phantom{j}||z_{j}-z_{l}||^{2}&otherwise.\\ \end{cases}

(2)

In essence, $\rho_{l}$ represents the distance between the given token $z_{l}$ from other high-density tokens. We then use $\sigma_{l}\times\rho_{l}$ as the weighted density index for each $z_{l}$ and sort them in descending order. The top $K_{c}$ frames with the highest indices are selected as the most representative video moments related to the user instruction. We use the indices of these $K_{c}$ centers to extract the representations with a higher spatial resolution for each frame from the feature buffer containing ViT encodings. Finally, these representations are pooled to a desired number of tokens per frame denoted as $\hat{Z}$ . This mechanism effectively “rewinds” through the video’s latent space to identify key moments, inspiring ReWind’s name.

3.4 Large Language Model

The input to the LLM is constructed by concatenating $M$ with the dense representations $\hat{Z}$ , separated by a special token $\tau$ : $<m_{0},m_{1},\dots,\tau,\hat{Z}>$ . The role of $\tau$ is purely to separate the memory content with progressive temporal information from the DFS frames where the spatial information is prioritized. The video content is then combined with the text instruction and given in input to the LLM.

3.5 Training

Instruction tuning has been a crucial training strategy for VLMs, especially for the QA tasks, as demonstrated from previous works [12, 15]. Inspired by this, our training strategy is divided into two stages.

Multimodal Pretraining Stage. During the initial stage, we conduct standard multimodal alignment, keeping the network components except the perceiver frozen. This phase aims to empower our ReWind to effectively capture semantic visual information without any compromise in the performance of the overall pipeline. Specifically, it involves contrastive learning utilizing the SigLIP [26] loss between perceiver projections and caption encodings from BERT.

Instruction Tuning Stage. The second training stage engages the memory module, the DFS, and the LLM (fine-tuned using LoRA). This phase employs the instruction-tuning strategy on multimodal instruction-tuning datasets, aiming to integrate all network components seamlessly for the VQA and temporal grounding tasks.

4 Experiments

Model	Num Frames	Num Tokens	Global VQA		Breakpoint VQA		Global Generation					Breakpoint Generation
Model	Num Frames	Num Tokens	Accuracy	Score	Accuracy	Score	CI	DO	CU	TU	CO	CI	DO	CU	TU	CO
Video LLaMA [27]	32	32	51.4	3.10	38.2	2.31	3.30	2.53	3.28	2.77	3.42	2.42	2.85	2.87	2.00	2.87
Video-ChatGPT [18]	100	356	44.2	2.71	49.8	2.71	2.48	2.78	3.03	2.48	2.99	3.11	3.32	3.29	2.62	3.29
Video Chat [14]	32	3072	61.0	3.34	48.3	2.43	3.26	3.20	3.38	2.97	3.47	2.96	3.09	3.24	2.46	3.22
MovieChat [21]	2048	8192	67.8	3.81	50.4	2.96	3.32	3.28	3.44	3.06	3.48	3.07	3.24	3.31	2.70	3.45
ReWind (Ours)	548*	1184*	80.6	4.46	57.2	3.4	4.18	4.00	4.24	4.02	3.54	3.41	3.37	3.64	2.97	3.61

Table 1: Evaluation for long VQA on MovieChat-1K test set with GPT-3.5. The best result is in bold, and the second best is underlined. Rewind uses a fixed frame rate of 1fps, so the number of frames in input varies based on the video length. ’*’ means the quantity is variable.

Model	Charades-STA
Model	R@0.3	R@0.5	R@0.7	mIoU
Video Chat [14]	9.0	3.3	1.3	6.5
Video LLaMA [27]	10.4	3.8	0.9	7.1
Video-ChatGPT [18]	20.0	7.7	1.7	13.7
GroundingGPT [16]	-	29.6	11.9	-
TimeChat [20]	-	32.2	13.4	-
VTimeLLM [11]	51.0	27.5	11.4	31.2
ReWind (Ours)	59.0	41.6	20.53	39.3

Table 2: Temporal video grounding on Charades-STA. Best results are emphasized in bold, and second-bests are underlined. We compare against works that use only video inputs. De-emphasized results use transcribed speech in input.

4.1 Experimental Setup and Datasets

Model Settings. ReWind’s architecture is built upon the EVA-02 vision encoder (ViT-G/14) [8] and a 7B-parameter LLaMA-2 LLM [23]. The perceiver block, illustrated in Fig. 2 (b), consists of 8 sequential layers. Additionally, we utilize 32 queries for reading and perceiving information and two write queries to ensure efficient memory storage. The DFS mechanism selects 64 frames in the first selection phase and then refines this to 8 representative frames. These selected frames are then pooled into 32 tokens per frame before being integrated with the memory content.

Training Setup and Data. We pretrain ReWind on 100K video-caption pairs randomly selected from the WebVid2.5M [2] and Panda70M [6] datasets. This stage involves 10K steps with a batch size of 64, using the AdamW optimizer and cosine scheduling. The learning rate is set to 1e-4 with 500 warmup steps. For instruction tuning, we combine multimodal instruction data from VideoChatGPT [18] with the same 100,000 video-caption pairs used in the pretraining stage. All frames are resized to 224 $\times$ 224 pixels. During this stage, ReWind is trained for 100,000 steps with a batch size of 64, a learning rate of 5e-5, and 2,000 warmup steps, using the same optimizer and scheduler as in pretraining. We utilize LoRA for the LLM with a rank of 64 and alpha of 32. For temporal grounding tasks, ReWind undergoes additional fine-tuned on DiDemo [1] and ActivityNet [4] datasets with manually annotated QA pairs with temporal boundaries for an extra 15K steps using the same optimizer and learning rate. Remarkably, our model can obtain great results while being trained on only 8 $\times$ V100 GPUs. Further details regarding the data and training setup can be found in the supplementary material.

4.2 Datasets and Evaluation

Long Video. We evaluate ReWind’s performance on two tasks: VQA and temporal grounding. For VQA, we use the MovieChat-1K test set[14], with a video average length of 9.13 minutes. We assess VQA performance using three metrics: accuracy, score, and generation quality, determined by comparing the generated answer to the ground truth (GT) using GPT-3.5. Accuracy measures the exact matches between answers and GT, while the score measures their proximity in meaning with a score from 0 to 5. Generation quality is evaluated using the protocol proposed in [14] based on five metrics: correctness of information (CI), detailed orientation (DO), contextual understanding (CU), temporal understanding (TU), and consistency (CO). Each metric is assigned a score from 0 to 5 by GPT-3.5 by comparing the generated answer and the GT. For temporal grounding, we use Charades-STA [9]. We measure recall at various thresholds (30-70%) and mean IoU (mIoU) to compare the predicted time intervals with the GT.

Short Video. We evaluate ReWind’s performance on short-video benchmarks using the VideoChatGPT dataset and generation quality evaluation protocol.

Model	LLM	Backbone	CI	DO	CU	TU	CO	AVG
Video Chat [14]	Vicuna-7B	ViT-G	2.23	2.50	2.53	1.94	2.24	2.29
Video LLaMA [27]	Vicuna-7B	ViT-G	1.96	2.18	2.16	1.82	1.79	1.98
Video-ChatGPT [18]	Vicuna-7B	ViT-L	2.40	2.52	2.62	1.98	2.37	2.38
LLaMA Adapter [28]	LLaMA-7B	ViT-L	2.03	2.32	2.30	1.98	2.15	2.16
Chat-UniVi [12]	Vicuna1.5-7B	ViT-L	2.89	2.91	3.46	2.39	2.81	2.89
VTimeLLM [11]	Vicuna1.5-7B	ViT-L	2.78	3.10	3.40	2.49	2.47	2.85
MovieChat [21]	LLaMA2-7B	ViT-G	2.76	2.93	3.01	2.24	2.42	2.67
LLaMA-VID [15]	Vicuna-7B	ViT-G	2.96	3.00	3.53	2.46	2.51	2.89
ReWind (Ours)	LLaMA2-7B	ViT-G	2.91	2.85	3.42	2.71	2.68	2.91

Table 3: Evaluation for short VQA on VideoChatGPT test set with GPT-3.5. The best result in bold, and the second best underlined.

4.3 Results on Long Videos

VQA. The MovieChat-1K dataset is a challenging long-video benchmark with an average video length of 9.13 minutes. It contains 1,000 videos, each with multiple open-ended questions in two settings: global and breakpoint. The global setting requires processing the entire video and answering questions about its content, while the breakpoint mode involves processing the video up to a specific timestamp and answering questions about the event at that point. Table 1 presents the results for both settings on the test set, showcasing ReWind’s performance on generation quality and accuracy-score metrics. The analysis reveals that ReWind significantly outperforms previous approaches across all metrics, particularly surpassing MovieChat [21], specifically designed for long videos. Notably, ReWind achieves these superior results while utilizing approximately 1/8 of the tokens and 1/4 of the frames required by the prior best model. This demonstrates ReWind’s ability to effectively model temporal relationships over extended sequences and its efficiency in encoding information with a minimal number of tokens.

Temporal Grounding. Charades-STA [9] test set contains manually annotated QA pairs with temporal boundaries, providing a challenging testbed for assessing a model’s understanding of event dynamics. We benchmark ReWind against existing VLM approaches and report results in Table 2. Notably, ReWind significantly outperforms all previous models that rely solely on video input across all metrics. This highlights ReWind’s exceptional ability to accurately track and interpret the temporal progression of events.

4.4 Results on Short Videos

To further assess ReWind’s capabilities, we evaluate its performance on the VideoChatGPT QA test set, which features open-ended questions with more detailed answers. Utilizing the generation evaluation protocol, the results are presented in Table 3. ReWind achieves a higher overall average score (AVG) than all previous short and long-term methods, demonstrating strong performance even in short videos. Notably, ReWind excels in the temporal understanding (TU) metric, confirming its superior ability to capture and comprehend temporal information. More experiments can be found on the supplementary material.

5 Ablation

Core Mechanisms. Table 4 presents an ablation of ReWind’s core components — the memory module, and the DFS mechanism — on long videos. We establish a baseline model that uses 64 uniformly sampled frames and incorporates the perceiver block as an adapter layer, with each frame encoded using 32 tokens. We then progressively incorporate the memory and DFS to complete ReWind’s architecture. Note that when we add the components, the video is processed at 1 fps, and each frame is encoded with 2 tokens to align with our design. The results demonstrate that memory and DFS significantly contribute to ReWind’s performance on long videos. To assess the effectiveness of these components on shorter videos, we conduct a similar ablation using the VideoChatGPT dataset, which consists of short videos, and report the findings in Table 5. Notably, combining memory and DFS leads to substantial improvements over the baseline, even when applied to short videos.

Model	Global		Breakpoint
Model	Accuracy	Score	Accuracy	Score
Baseline	61.5	3.21	49.1	2.62
Mem	76.8	4.21	52.1	3.11
Mem+DFS	80.6	4.46	57.2	3.40

Table 4: Ablation study on how the memory mechanism (Mem) and the DFS affect the performance of Rewind on MovieChat-1K.

Model	CI	DO	CU	TU	CO	AVG
Baseline	2.54	2.72	3.27	2.46	2.60	2.72
Mem	2.76	2.56	3.13	2.58	2.62	2.73
Mem+DFS	2.91	2.85	3.42	2.71	2.68	2.91

Table 5: Ablation study on how the memory mechanism (Mem) and the DFS affect the performance of Rewind on VideoChatGPT.

Perceiver. In our architectural design, the perceiver layer is conditioned on the text and past information through reading queries. We validate the effects these elements have on the perceiver in Table 6. Particularly, we start with ReWind without DFS and experiment with different conditions.

Text	Read	Global		Breakpoint
Text	Read	Accuracy	Score	Accuracy	Score
✗	✓	74.7	4.06	51.2	3.09
✓	✗	69.1	3.76	48.7	2.81
✓	✓	76.8	4.21	52.1	3.11

Table 6: Ablation study on conditioning the perceiver with text and read information (past content).

Number of Frames vs. Performance and Memory. Figure 4 illustrates the impact of varying the number of input frames (ranging from 64 to 1024) on ReWind’s performance and GPU memory requirements. ReWind’s performance improves as the number of frames increases, reaching an optimal point at 512 frames (approximately 1 fps sampling). Beyond this point, performance declines when using 1024 frames (around 2 fps sampling). This decline is likely due to the deviation from ReWind’s training regime and the introduction of high redundancy in token representations.

Memory consumption, measured using 16-bit precision, peaks at 29GB for 1024 frames. Notably, ReWind can process a 10-minute video with less than 25GB of memory, making it compatible with standard end-user GPUs. Additionally, peak memory consumption is influenced by the choice of the LLM, and the number of input tokens. Utilizing different LLM quantizations (e.g., 8-bit) can substantially reduce memory requirements.

Hyperparameters. We ablate on the hyperparameters of ReWind to assess their impact. Initially, we vary the number of tokens per frame stored in memory without DFS to clearly understand its individual effect. The results are presented in Table 7. Furthermore, in Table 8, we investigate DFS-specific hyperparameters: the number of selected tokens during the instruction-selection stage ( $L$ ) and the number of final selected frames ( $K_{c}$ ).

DFS vs. Uniform Sampling. Finally, we ablate the benefits of having DFS to uniform sampling for long videos. In both scenarios, the number of selected frames is 8. The outcomes of this comparison are detailed in Table 9.

TPF	Global		Breakpoint
TPF	Accuracy	Score	Accuracy	Score
1	71.1	3.91	49.0	2.87
2	76.8	4.21	52.1	3.11
3	79.2	4.34	54.7	3.34
4	79.7	4.41	54.9	3.41

Table 7: Ablation study on the numbers of tokens per frame (TPF) stored in memory (write queries) on MovieChat-1K test set.

$L$	$K_{c}$	Global		Breakpoint
$L$	$K_{c}$	Accuracy	Score	Accuracy	Score
16	8	77.1	4.25	39.8	2.7
32	8	78.1	4.32	48.6	2.9
64	8	80.6	4.46	57.2	3.4
128	8	80.1	4.45	56.1	3.4
64	4	77.9	4.30	52.1	3.11
64	8	80.6	4.46	57.2	3.40
64	16	81.5	4.52	55.2	3.18

Table 8: Ablation study on the hyperparameters of the DFS mechanism. We explore the effects of varying the number of frames selected in the instruction-based selection stage.

FS Strategy	Global		Breakpoint
FS Strategy	Accuracy	Score	Accuracy	Score
Uniform	77.5	4.29	54.1	3.31
DFS	80.6	4.46	57.2	3.40

Table 9: Ablation on frame selection (FS) strategy on MovieChat-1K test set. Comparison between DFS and uniform sampling.

5.1 Qualitative Results

Fig. 5 provides qualitative examples showcasing ReWind’s ability to comprehend long videos while preserving fine-grained details. We pose two types of questions: (1) a comprehensive description of the entire video content, where ReWind captures the overall narrative and key events, and (2) a question about the changing weather throughout the video, testing ReWind’s ability to track and recall information across different scenes. We highlight the model’s hallucinations in red. In (2), we highlight selected frames from DFS and their corresponding text using matching colors.

6 Conclusions

This work introduces ReWind, a novel memory-based vision-language model that enables an efficient understanding of long videos while maintaining temporal fidelity. ReWind features a dynamic learnable memory module with an innovative read-perceive-write cycle for instruction-guided feature encoding and robust temporal representation construction. Additionally, we propose an adaptive frame selection mechanism guided by memory contents to identify instruction-relevant key moments, enriching memory representations with detailed spatial information. Our evaluation demonstrates significant performance gains on various long-video benchmarks, including visual question answering and temporal grounding tasks. These results highlight ReWind’s effectiveness in comprehensive video understanding and its potential for real-world applications requiring deep temporal reasoning over extended video content.

Acknowledgments

We gratefully acknowledge Sami Remes for his invaluable contributions to this work. His expertise and insights, particularly during his time with the video understanding team at Huawei Helsinki Research Center, were instrumental in designing and implementing the perceiver block. We extend our sincere thanks for his dedication and support.

References

Anne Hendricks et al. [2017] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Int. Conf. Comput. Vis., pages 5803–5812, 2017.
Bain et al. [2021”] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Int. Conf. Comput. Vis., pages 1–8, 2021”.
Bommasani et al. [2021] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
Caba Heilbron et al. [2015] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Int. Conf. Comput. Vis., pages 961–970, 2015.
Chen and Dolan [2011] David L. Chen and William B. Dolan. Collecting highly parallel data for paraphrase evaluation. In Association for Computational Linguistics, pages 1–10, 2011.
Chen et al. [2024] Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13320–13331, 2024.
Du et al. [2016] Mingjing Du, Shifei Ding, and Hongjie Jia. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowledge-Based Systems, 99:135–145, 2016.
Fang et al. [2023] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In IEEE Conf. Comput. Vis. Pattern Recog., pages 19358–19369, 2023.
Gao et al. [2017] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Int. Conf. Comput. Vis., pages 5267–5275, 2017.
He et al. [2024] Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13504–13514, 2024.
Huang et al. [2024] Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In IEEE Conf. Comput. Vis. Pattern Recog., pages 14271–14280, 2024.
Jin et al. [2024] Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13700–13710, 2024.
Li et al. [2023a] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Int. Conf. Mach. Learn., pages 19730–19742. PMLR, 2023a.
Li et al. [2023b] Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1–8, 2023b.
Li et al. [2024a] Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In Eur. Conf. Comput. Vis., pages 1–14, 2024a.
Li et al. [2024b] Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou, Junting Pan, Zefeng Li, Vu Tu, et al. Groundinggpt: Language enhanced multi-modal grounding model. In Association for Computational Linguistics, pages 6657–6678, 2024b.
Lin et al. [2023] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. In Conf. on Empirical Methods in Nat. Lang. Process., pages 1–10, 2023.
Maaz et al. [2024] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Association for Computational Linguistics, pages 12585–12602, 2024.
Piergiovanni et al. [2024] AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S Ryoo, Victor Gomes, and Anelia Angelova. Mirasol3b: A multimodal autoregressive model for time-aligned and contextual modalities. In IEEE Conf. Comput. Vis. Pattern Recog., pages 26804–26814, 2024.
Ren et al. [2024] Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 14313–14323, 2024.
Song et al. [2024] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 18221–18232, 2024.
Team et al. [2024] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, pages 1–20, 2024.
Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, pages 1–20, 2023.
Vaswani [2017] A Vaswani. Attention is all you need. In Adv. Neural Inform. Process. Syst., pages 1–10, 2017.
Xu et al. [2016] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5288–5296, 2016.
Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Int. Conf. Comput. Vis., pages 11975–11986, 2023.
Zhang et al. [2023] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. In Conf. on Empirical Methods in Nat. Lang. Process., pages 543–553, 2023.
Zhang et al. [2024] Renrui Zhang, Jiaming Han, Chris Liu, Aojun Zhou, Pan Lu, Yu Qiao, Hongsheng Li, and Peng Gao. Llama-adapter: Efficient fine-tuning of large language models with zero-initialized attention. In Int. Conf. Learn. Represent., pages 1–10, 2024.
Zhou et al. [2022] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. Int. J. Comput. Vis., 130(9):2337–2348, 2022.
Zhu et al. [2024] Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. In Int. Conf. Learn. Represent., pages 1–10, 2024.