(Translated by https://www.hiragana.jp/)
Question-Answering Dense Video Events

Question-Answering Dense Video Events

Hangyu Qin1, Junbin Xiao111Corresponding author.1, Angela Yao1
Abstract

Multimodal Large Language Models (MLLMs) have shown excellent performance in question-answering of single-event videos. In this paper, we present question-answering dense video events, a novel task that requires answering and grounding the dense-event questions in long videos, thus challenging MLLMs to faithfully comprehend and reason about multiple events occurring over extended time periods. To facilitate the study, we construct DeVE-QA – a dataset featuring 78K𝐾Kitalic_K questions about 26K𝐾Kitalic_K events on 10.6K𝐾Kitalic_K long videos. We then benchmark and show that existing MLLMs excelling at single-event QA struggle to perform well in DeVE-QA. For improvement, we propose DeVi, a novel training-free MLLM approach that highlights a hierarchical captioning module, a temporal event memory module, and a self-consistency checking module to respectively detect, contextualize and memorize, and ground dense-events in long videos for question answering. Extensive experiments show that DeVi is superior at answering dense-event questions and grounding relevant video moments. Compared with existing MLLMs, it achieves a remarkable increase of 4.1% and 3.7% for G(round)QA accuracy on DeVE-QA and NExT-GQA respectively. Our data and code will be released.

Introduction

Multimodal Large Language Models (MLLMs) (Alayrac et al. 2022; Li et al. 2023b; Maaz et al. 2023; Zhang, Li, and Bing 2023; Lin et al. 2023; Reid et al. 2024) have shown significant capability in question-answering of single-event videos (Xu et al. 2017; Jang et al. 2017), where the videos are short in 3 similar-to\sim 20 seconds and the QAs factor single global types of events, e.g. “who did what”. Yet, real-world video often comes in long format and features a complex overlay of dense events. Consider the 2-minute video taken from a motorcycle activity shown in Figure 1. A variety of questions can be asked about this video, with each pertaining to an individual event but involving different participants and durations interspersed throughout the video. The events, while being separate, are still related to each other, e.g. a motorcycle stunt performance.

The inherent challenge of understanding such dense video events is thus to either isolate or agglomerate, as needed, relevant video content and generate relevant event responses. While part of the challenge is tackled in dense-event captioning (Krishna et al. 2017) (e.g., isolation and generation), the holistic caption generation offers very limited insight of reasoning and understanding of dense video events, as MLLMs are prone to hallucination (Ma et al. 2023). Furthermore, evaluating captions is challenging, as the annotations are often subjective (Wang, Deng, and Jia 2024) and the generated captions are often in diverse language formats (Vedantam, Lawrence Zitnick, and Parikh 2015). Alternatively, video question answering inherits all the challenge for dense event understanding. It also enables deterministic evaluation by multi-choice classification (Xiao et al. 2021; Mangalam, Akshulakov, and Malik 2024; Patraucean et al. 2024). As such, we propose question-answering of dense video events, a novel task that challenges MLLMs in comprehending and reasoning the dense events occurring over long-lasting videos.

Refer to caption
Figure 1: Example of DeVE-QA.

Specifically, given a video that carries multiple events and a question about a specific event in the video, question-answering dense video events requires MLLMs to comprehend the question to the relevant event and reason over the event to derive the correct answer. For comprehending, we require the models to localize the relevant video moments, disambiguate different video events to avoid conflicting answers, and thus substantiate the the predictions with visual evidences. The task delivers 3 particular challenges. First, each question pertains to a specific event at a specific time duration (see Figure 1). The duration varies among events, so to precisely comprehend the questions, it is imperative to capture the events spanning over different time scales. Second, the long-form videos poses a challenge in articulating the possible distant contextual events for understanding a particular questioned event. Finally, to promote faithful reasoning, a correct answer prediction necessitates correct grounding and question answering. This asks for strong capability of dense visual event understanding and conditioning, versus exploiting common-sense knowledge in LLMs.

As there is no suitable benchmark for question-answering of dense video events, we construct DeVE-QA, a Dense Video Event QA dataset featuring 78K78𝐾78K78 italic_K questions about 26K𝐾Kitalic_K events on 10.6K10.6𝐾10.6K10.6 italic_K videos. DeVE-QA is constructed by curating multi-choice questions from the dense-event caption annotations of ActivityNet-Caption (Krishna et al. 2017), specifically via prompting GPT-4 accompanied with rigorous manual checking and correction.

With DeVE-QA, we first benchmark the prominent MLLMs (Wang et al. 2023; Yu et al. 2024; Momeni et al. 2023; Surís, Menon, and Vondrick 2023; Zhang et al. 2023a; Kim et al. 2024) that perform well in popular videoQA about single global events, but find that their performances drop significantly, especially on the DeVE-QA subsets that features denser events and longer videos. This reflects the models’ severe deficiency in understanding dense-events long videos and in faithful reasoning for question answering. For improvement, we propose a training-free MLLM approach DeVi. DeVi performs dense video-event QA by first detecting from the video multiple events and then reason over the events to achieve QA. To solve the aforementioned challenges, we incorporates three specific strategies: 1) hierarchical dense event captioning to detect the dense events at multiple temporal scales, 2) temporal event contextualizing and memorizing to capture long-term event dependency and to facilitate event-grounded QA, and 3) self-consistency checking to anchor or rectify the answers with regard to the grounded event moments.

We evaluate DeVi on DeVE-QA, and for better comparison, we also extend our experiments to the recent NExT-GQA (Xiao et al. 2024b). We achieve accuracy increases of 4.1% and 6.6% over the state-of-the-arts (SoTAs) on DeVE-QA for QA with and without grounding respectively. Also, we improve GQA accuracy on NExT-GQA by 3.7%. Further ablation experiments validate DeVi’s strength and its particular designs for dense-event and long-form video QA. Additionally, we share our investigation of other alternative implementations for DeVi, e.g. different MLLMs for captioners and QA models, and highlight the crucial importance of large models for success.

To summarize, our contributions are as follows:

  • We propose question answering dense video events to challenge MLLMs in comprehending and reasoning the dense events in long videos. Accordingly, we construct DeVE-QA dataset to facilitate the study.

  • We propose DeVi, a training-free MLLM approach that performs grounded question-answering on dense video events by highlighting three dedicated components of hierarchical dense-event captioning, event contextualizing and memorizing, and self-consistency checking.

  • We achieve new SoTA zero-shot results on both DeVE-QA and NExT-GQA, surpassing the previous SoTAs profoundly by 4.1% and 3.7%, respectively.

Related Works

Dense Event Video Understanding

Dense video event understanding has primarily focused on captioning (Krishna et al. 2017; Wang et al. 2018; Lin et al. 2022; Yang et al. 2023). However, optimizing for holistic sentence generation often results in over-fitting (Chen, Li, and Hu 2020) and object hallucination (Rohrbach et al. 2018). MLLMs on the other hand, have shown strong capabilities for visual description (Li et al. 2023a; Liu et al. 2024; Maaz et al. 2023; Li et al. 2023b; Lin et al. 2023; Ren et al. 2023; Xu et al. 2024). Yet, the subjective caption annotations and the sub-effective sentence-matching metrics (e.g., BLEU (Papineni et al. 2002) and CIDEr (Vedantam, Lawrence Zitnick, and Parikh 2015)) make it challenging to evaluate these models, especially from a zero-shot perspective. Our work proposes to use question-answering as an alternative to evaluate the understanding and reasoning of dense video events.

Video Question Answering

VideoQA works are center on single event videos; this is reflected in the popular benchmarks, such as TGIF-QA (Jang et al. 2017), MSRVTT-QA and MSVD-QA (Xu et al. 2017), ActitivityNet-QA (Yu et al. 2019) and iVQA (Yang et al. 2021), and related techniques (Dai et al. 2023; Maaz et al. 2023; Zhang et al. 2023b; Wang et al. 2023; Li, Wang, and Jia 2023). The video clips in these benchmarks tend to be short or the questions are related to global events spanning the entire clips. NExT-QA (Xiao et al. 2021) advances somewhat by addressing multiple action relations in relatively longer clips. The videos, however, focus on daily life actions and lack complexity in multi-event understanding. We also note that some techniques claim for event VideoQA (Yin et al. 2023; Liu, Li, and Lin 2023; Bai, Wang, and Chen 2024) but the events essentially refer to actions alone or single global event of short video. Compared with these works, our work shapes itself by studying multi-event comprehending and reasoning across long videos, where an event refers to a complete combination of subjects, actions, objects, time, etc (Krishna et al. 2017).

MLLMs for VideoQA

Most existing Video-LLMs are designed for short-video understanding (Xiao et al. 2024a). This includes the instruction-tuned models such as Video-ChatGPT (Maaz et al. 2023), Video-LLaMA (Zhang, Li, and Bing 2023), Video-LLaVA (Lin et al. 2023), VideoChat (Li et al. 2023b, c) and PLLaVA (Xu et al. 2024), and target-finetuned models like SeViLA (Yu et al. 2024) and LLaMA-VQA (Ko et al. 2023). The short input (4similar-to\sim32 frames) restricts these models from handling long videos. Training-free approaches, such as ViperGPT (Surís, Menon, and Vondrick 2023) and LLoVi (Zhang et al. 2023a), handle long videos by traversing or dense-captioning the video. Traversal approaches cannot agglomerate multiple events interspersed at different times for joint reasoning. Therefore, we follow the Caption-then-QA pipeline of LLoVi (Zhang et al. 2023a). Yet, we incorporate dedicated modules for enhanced dense-event capturing and event-grounded QA.

DeVE-QA  Dataset

We follow dense-event captioning (Krishna et al. 2017) to define an event as a completed description of a person’s (or a group’s) specific behavior within a specific time, e.g., “A man is playing the piano at [10.2s, 34.5s]”. Therefore, we curate our dataset DeVE-QA  from ActivityNet-Captions.

Dataset Construction

Given dense event captions, we derive question-answer sets by prompting GPT-4 (OpenAI 2024a) followed by human checking and corrections. Specifically, the construction process has three major stages (see Figure 2). In the first stage, we prompt GPT-4 to generate different types of question-answer pairs (QAs) corresponding to each individual event using videos with clear and long event descriptions captions (i.e., no pronouns and longer than 10 words). This encourages understanding the event from multiple different aspects, e.g. with an implicit pattern of ”who did what at where and when, why and how” implied from generation prompts. In the second stage, we retrieve distractor answers to form multiple choices for each question to facilitate deterministic evaluation. The distractor answers are from the answers of top-similar questions. Additionally, we incorporate approaches to maximally limit potential bias from the candidate answers, such as adding distractor answers related to different events in the same video. Then we also perform QA filtration to remove meaningless questions and also analyze the key activities inside the videos. The third stage is manual checking and correction to ensure the QA quality. We specially correct for 1) wrong QA pairs, 2) redundant questions, and 3) potential correct distractor answers. Finally, we obtain around 78k questions. We present an example in Figure 1. Other details along with the QAs are attached in Supplementary.

Refer to caption
Figure 2: DeVE-QA construction pipeline.

Statistics and Analysis

Split # Vid. # Que. # Avg. QLen Seg. Dur.(s) Vid. Dur.(s) Ratio (S./V.) Train 7,179 53,361 10.70 38.68 127.32 0.32 Test 3,464 24,963 10.71 40.98 125.03 0.34

Table 1: Statistics of DeVE-QA. Ratio (S./V.): Average length of segments w.r.t. the entire video.
Refer to caption
Refer to caption
Figure 3: DeVE-QA  analysis. (a) Questions based on first two words.(b) Certificate length of VideoQA datasets.
Refer to caption
Figure 4: DeVi  pipeline: (1) Hierarchical dense event video segmenting and captioning, (2) contextualizing and memorizing events in temporal event memory, and (3) event-grounded video question answering with self-consistency checking.

Dataset D.E. Vid. Dur.(s) #QAs Seg. Len(s) MSRVTT-QA (Xu et al. 2017) 15 243K MSVD-QA (Xu et al. 2017) 10 50K TGIF-QA (Jang et al. 2017) 3 139K ActivityNet-QA (Yu et al. 2019) 118 58K NExT-QA (Xiao et al. 2021) 44 52K TVQA (Lei et al. 2018) 76 152k 11.2 NExT-GQA (Xiao et al. 2024b) 42 43K 7.0 DeVE-QA  (ours) 127 78K 39.4

Table 2: Dataset comparison. D.E.: dense event.

DeVE-QA  is the first benchmark dataset that support question-answering of dense events in long videos. Table 1 shows detailed statistics of our DeVE-QAdataset. It comprises 10.6k (7.2k training / 3.5k testing) videos and 78.3k (53.3k training / 25k testing) questions. The average video length is 127s, with also many videos (more than 580) ranging from 4 to 10 minutes, The average number of questions per video is 7.5, and the average number of events per video is 2.6 (vs.  1 for most other benchmarks). A comparison between this two suggests that an average of 2.5 questions are posed about an individual event. Figure 3 shows the distribution of question types; questions are not only about “what is done” but also go beyond that to infer “how” and “why” questions to target a more comprehensive understanding of events. Note that the “when” questions are hidden in the requirement on temporal grounding. Also, we limit the number of “who” and “where” questions to keep them in a low percentage of the dataset, as they can be well-answered without the need for video-level understanding (Xu et al. 2017; Lei et al. 2018). Other analyses are presented in Supplementary.

Comparison with Existing Benchmarks

Table 2 compares DeVE-QA with existing VideoQA datasets. First and foremost, DeVE-QA targets at dense event and long-form VideoQA and enables temporal grounding evaluation. These requirement stands out from all existing datasets which focus on global video event (e.g., all datasets in the 1st block except for NExT-QA) and short videos (e.g., the top-3 datasets listed in Table 2). Compared with other temporal grounding datasets such as TVQA (Lei et al. 2018) and NExT-GQA (Xiao et al. 2024b), DeVE-QA has longer videos and segments, shaping its challenge for event-level QA. For example, Figure 3 shows that the temporal certificate length (average length of video segments needed to answer a question (Mangalam, Akshulakov, and Malik 2024)) of DeVE-QA is 5.5×\times× that of NExT-GQA (Xiao et al. 2024b). In addition, TVQA pays attention to simple visual recognition of “what is” in TV shows. Its temporal grounding are biased to localizing the subtitles invoked in the QAs.

DeVi Solution

Overview

Formally, given a T𝑇Titalic_T-second video v𝑣vitalic_v containing a collection of events E={e1,e2,,en}𝐸subscript𝑒1subscript𝑒2subscript𝑒𝑛E=\{e_{1},e_{2},\cdots,e_{n}\}italic_E = { italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, a question q𝑞qitalic_q along with candidate answer set C={c1,,c5}𝐶subscript𝑐1subscript𝑐5C=\{c_{1},\cdots,c_{5}\}italic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT }, dense video-event QA is to predict a correct answer c^C^𝑐𝐶\hat{c}\in Cover^ start_ARG italic_c end_ARG ∈ italic_C and the relevant event moment t^={ts,te}^𝑡subscript𝑡𝑠subscript𝑡𝑒\hat{t}=\{t_{s},t_{e}\}over^ start_ARG italic_t end_ARG = { italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT } where tsteTsubscript𝑡𝑠subscript𝑡𝑒𝑇t_{s}\leq t_{e}\leq Titalic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ≤ italic_T. Our solution is conceptually as follows:

c^,t^=ψ(c,t|E,q,C)ϕ(E|v),^𝑐^𝑡𝜓𝑐conditional𝑡𝐸𝑞𝐶italic-ϕconditional𝐸𝑣\hat{c},\hat{t}=\psi(c,t|E,q,C)\phi(E|v),over^ start_ARG italic_c end_ARG , over^ start_ARG italic_t end_ARG = italic_ψ ( italic_c , italic_t | italic_E , italic_q , italic_C ) italic_ϕ ( italic_E | italic_v ) , (1)

where ϕitalic-ϕ\phiitalic_ϕ and ψ𝜓\psiitalic_ψ denote the models for dense event detection and event-conditioned QA respectively. Note that the time stamps t𝑡titalic_t come along with the detected events E𝐸Eitalic_E.

we realize the objective defined in Eqn. (1) as follows. First, to achieve dense video event detection ϕ(E|v)italic-ϕconditional𝐸𝑣\phi(E|v)italic_ϕ ( italic_E | italic_v ), we incorporate a hierarchical dense captioning mechanism into MLLMs to detect the video events at multiple different time scales. Then, we design a temporal event memory module that captures the long-term event dependency to contextualizes and also memorize the individually detected video events E𝐸Eitalic_E. Finally, to achieve event-grounded QA ψ(c,t|E,q,C)𝜓𝑐conditional𝑡𝐸𝑞𝐶\psi(c,t|E,q,C)italic_ψ ( italic_c , italic_t | italic_E , italic_q , italic_C ), we read from the memory the contextualized events E𝐸Eitalic_E, and feed it to LLMs along with the QAs (question q𝑞qitalic_q and candidate answers C𝐶Citalic_C) to determine the correct answers and the corresponding event moments. In this process, we highlight a self-consistency checking mechanism to ensure the right answer for the right event. An overview of our solution is illustrated in Figure 4.

Hierarchical Dense Event Captioning

Dense events within videos are often intertwined and vary in durations. To successfully detect these events, we apply powerful MLLMs (e.g., Video-LLaVA (Lin et al. 2023)) at multiple scales and levels of temporal hierarchies. Specifically, we build a H-level hierarchy and detect events by captioning different lengths of video segments at different hierarchies. Our captioning starts from the bottom hierarchy for short video segments Vs={vkLs}k=1Nssubscript𝑉𝑠superscriptsubscriptsubscriptsuperscript𝑣subscript𝐿𝑠𝑘𝑘1subscript𝑁𝑠V_{s}=\{v^{L_{s}}_{k}\}_{k=1}^{N_{s}}italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { italic_v start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, which is achieved by sending Vssubscript𝑉𝑠V_{s}italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to MLLMs and prompting the MLLMs to describe the video segments. Corresponding events are denoted as Es={ekLs}k=1Nssubscript𝐸𝑠superscriptsubscriptsubscriptsuperscript𝑒subscript𝐿𝑠𝑘𝑘1subscript𝑁𝑠E_{s}=\{e^{L_{s}}_{k}\}_{k=1}^{N_{s}}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { italic_e start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Lssubscript𝐿𝑠L_{s}italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are the number and length of short video segments respectively. A specific event eksubscript𝑒𝑘e_{k}italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is given by its text description along with the corresponding start tssubscript𝑡𝑠t_{s}italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and end tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT time stamps. Similarly, we caption the video segments of middle and long at the middle and top hierarchies, and obtain the respective events Em={ekLm}k=1Nmsubscript𝐸𝑚superscriptsubscriptsubscriptsuperscript𝑒subscript𝐿𝑚𝑘𝑘1subscript𝑁𝑚E_{m}=\{e^{L_{m}}_{k}\}_{k=1}^{N_{m}}italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { italic_e start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and El={ekLl}k=1Nlsubscript𝐸𝑙superscriptsubscriptsubscriptsuperscript𝑒subscript𝐿𝑙𝑘𝑘1subscript𝑁𝑙E_{l}=\{e^{L_{l}}_{k}\}_{k=1}^{N_{l}}italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = { italic_e start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Note that Ls<Lm<LlTsubscript𝐿𝑠subscript𝐿𝑚subscript𝐿𝑙𝑇L_{s}<L_{m}<L_{l}\leq Titalic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT < italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT < italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ italic_T. Eventually, we obtain a collection of events E={Es,Em,El}𝐸subscript𝐸𝑠subscript𝐸𝑚subscript𝐸𝑙E=\{E_{s},E_{m},E_{l}\}italic_E = { italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } for each video. Specific prompts are presented in Supplementary.

Temporal Event Memory

The above events are independently detected by focusing on individual local video segments. The lack of contextual information often results in inaccurate or incomplete event captions. While the hierarchical captioning strategy helps alleviate the issue, it cannot model the long-term temporal event dependency. For example, in the video shown in Figure 1, we may have captured the event of “a man enters the field” at the beginning and “a biker is performing” at the middle of the video. However, we cannot answer questions such as why the man enter the field and who (man or woman) the biker is based on the individual event captions. By capturing temporal dependency, we aim to modify the events to be “a man enters the field for biking performance” and “a male biker is performing” to facilitate QA.

Thus, to capture the long-term event dependency, we design an event memory module to contextualize the event captions while also cache the original visual and event representations. To be specific, we achieve this by prompting LLMs (e.g., GPT-4o (OpenAI 2024b)) to refine each caption in a way like “… given a set of event captions {E} and a question {q} of a video, you are required to refine each caption by incorporating contextual information from all the other captions and question via analyzing the overall narratives, identifying relevant context and incorporate context with coherence…”. We also curate examples to perform in-context-learning for LLMs before the actual generation. Additionally, we prompt GPT-4o to articulate all events into a synopsis eysubscript𝑒𝑦e_{y}italic_e start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT which serves as a global event for the entire video. Consequently, we obtain E={Es,Em,El,ey}superscript𝐸superscriptsubscript𝐸𝑠superscriptsubscript𝐸𝑚superscriptsubscript𝐸𝑙subscript𝑒𝑦E^{\prime}=\{E_{s}^{{}^{\prime}},E_{m}^{{}^{\prime}},E_{l}^{{}^{\prime}},e_{y}\}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT }, in which the events at each level are enhanced with long-range temporal dependency. More details are in Supplementary.

Generally, by transferring the video into different representations (visual features, hierarchical captions and synopsis), this module links the contextual events from long-ranged time periods to aid in answering questions and grounding results about specific events.

Event-Grounded QA

Intuitively, we can read the events Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from the event memory and feed it to LLMs (e.g., GPT-4o) along with the QAs to accomplish answer prediction and moment localization. This can be achieved by prompting like “… select a correct answer from {C}𝐶\{C\}{ italic_C } to the question {q}𝑞\{q\}{ italic_q } based on the events {E}𝐸\{E\}{ italic_E } and also output the time span [ts,te]subscript𝑡𝑠subscript𝑡𝑒[t_{s},t_{e}][ italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ] of the event that carries the correct answer …”. This method is straightforward but we find that the performance is not as good as expected. There is a large discrepancy where the LLM often gives the correct answer but with wrong time span or vice-versa. For improvement, we establish a mechanism to check for consistency between a predicted answer and the corresponding time span.

We evaluate consistency based on the cosine similarity R𝑅Ritalic_R between the answer a𝑎aitalic_a and the video content within time span [ts,te]subscript𝑡𝑠subscript𝑡𝑒[t_{s},t_{e}][ italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ]:

Rva=cos(fv,fa)=fvfafvfa,subscript𝑅𝑣𝑎𝑐𝑜𝑠subscript𝑓𝑣subscript𝑓𝑎subscript𝑓𝑣subscript𝑓𝑎normsubscript𝑓𝑣normsubscript𝑓𝑎R_{va}=cos(f_{v},f_{a})=\frac{f_{v}\cdot f_{a}}{||f_{v}||||f_{a}||},italic_R start_POSTSUBSCRIPT italic_v italic_a end_POSTSUBSCRIPT = italic_c italic_o italic_s ( italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) = divide start_ARG italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ⋅ italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG | | italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | | | | italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | | end_ARG , (2)

where fasubscript𝑓𝑎f_{a}italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and fvsubscript𝑓𝑣f_{v}italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are encodings of the answer text and video segment using CLIP (Radford et al. 2021). Predictions with low consistency (i.e., small Rvasubscript𝑅𝑣𝑎R_{va}italic_R start_POSTSUBSCRIPT italic_v italic_a end_POSTSUBSCRIPT) will be feedbacked to LLM for adjusting its predictions. This processes will iterate multiple times before getting the predictions with consistency that is higher than a threshold σ𝜎\sigmaitalic_σ or reaching the predefined maximal iteration number δ𝛿\deltaitalic_δ. More details are presented in the Supplementary.

Experiments

Configuration and Evaluation

Our experiments are conducted on the test set of DeVE-QA. Additionally, we extend our experiments to NExT-GQA (Xiao et al. 2024b). NExT-GQA supports research for grounded QA about multiple actions though not for event grounding. It contains 990 videos and 5,553 questions for testing. For hierarchical event captioning, the number of hierarchies H𝐻Hitalic_H is set to 3, and the segment lengths Lssubscript𝐿𝑠L_{s}italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, Lmsubscript𝐿𝑚L_{m}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and Lhsubscript𝐿L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are set to {10s, 35s, 65s} for DeVE-QA and {5s, 15s, 45s} for NExT-GQA, respectively. For self-consistency checking, the similarity threshold σ𝜎\sigmaitalic_σ is set to 0.6 by implementation analysis in Table 5(d), and the maximal iteration number δ𝛿\deltaitalic_δ is set to 2 for efficiency. The thresholds are empirically determined according to the QA accuracy. For evaluation, we follow NExT-GQA (Xiao et al. 2024b) to report QA accuracy Acc@QA, grounding quality Intersection over Prediction (IoP) and Intersection over Union (IoU), as well as grounded QA accuracy Acc@GQA, all in percentages (%).

Performance Analysis

Model Acc@QA Model Acc@QA Video-LLaMA 41.2 Videochat2 58.7 InternVideo 48.3 SeViLA 61.2 VFC 49.5 GPT-4o 62.6 ViperGPT 55.1 PLLaVA 13B 63.7 Video-LLaVA 56.2 LLoVi 63.8 LLaMA-adapter(f/t) 58.3 IG-VLM 64.2 - - DeVi (ours) 70.8

Table 3: Zero-shot VideoQA results on DeVE-QA. Only LLaMA-Adapter(f/t) is fune-tuned.

Model mIoP IoP@0.5 mIoU IoU@0.5 Acc@QA Acc@GQA Weakly-supervised FrozenBiLM(NG+) 21.2 18.2 8.50 6.2 61.6 14.5 Temp[CLIP](NG+) 24.6 24.8 12.5 9.1 58.9 14.9 SeViLA* 25.8 19.9 21.2 11.5 62.7 16.1 Zero-shot LLoVi 27.5 27.0 17.9 12.9 63.9 22.8 DeVi (ours) 33.8 32.2 20.7 17.4 70.9 26.9

Table 4: Grounded VideoQA results on DeVE-QA. *: pre-trained on video-language grounding datasets.

We first adapt the prominent MLLMs (e.g., Video-LLaMA (Zhang, Li, and Bing 2023), InternVideo (Wang et al. 2023), VFC (Momeni et al. 2023), etc) that perform well on “single event” QA to DeVE-QA and compare them with DeVi. The models (except for LLaMA-Adapter (Zhang et al. 2023b)) are directly prompted for zero-shot VideoQA. We specify the adaptation in Supplementary. Most of these methods do not perform grounding, so we compare Acc@QA.

Table 3 shows that DeVi, with an accuracy of 70.8%, outperforms the second-best model IG-VLM (Kim et al. 2024) significantly by 6.6%. Moreover, DeVi surpasses a native use of GPT-4o (feed multiple video frames and prompt GPT-4o for question answering) remarkably by 8.2% and a naive dense-caption based QA method LLoVi (Zhang et al. 2023a) by 7.0%. We also find that all other end-to-end MLLMs such as Video-LLaMA, Video-LLaVA and VideoChat2 perform worse than DeVi by 10% similar-to\sim 30%. The results demonstrate that DeVi has made significant optimizations over general MLLMs on the challenges posed by performing question-answering on dense video events.

Table 4 presents grounded QA accuracy, comparing with methods from (Xiao et al. 2024b). DeVi surpasses state-of-the-art zero-shot method LLoVi by 4.1% on Acc@GQA. Furthermore, improvements come from both better QA (+7.0% Acc@QA) and better grounding (+5.2% IoP@0.5). This differs from the previous methods, where improvements are primarily from either better grounding or better QA alone (also see Table 5 on NExT-GQA).

Table 5 shows that DeVi consistently achieves superior performance on NExT-GQA, outperforming the second-best method LLoVi by 3.7%. The results demonstrate DeVi ’s superiority in multi-action video understanding aside from dense-event video understanding.

Ablation Study

Model mIoP IoP@0.5 mIoU IoU@0.5 Acc@QA Acc@GQA Weakly-supervised Temp[CLIP](NG+) 25.7 25.5 12.6 8.9 60.2 15.9 FrozenBiLM(NG+) 24.2 23.7 9.5 6.1 70.8 17.5 SeViLA* 29.5 22.9 21.7 13.8 68.1 16.6 Zero-shot LLoVi 37.3 36.9 20.0 15.3 66.8 24.3 DeVi 39.3 37.9 22.3 17.4 71.6 28.0

Table 5: Grounded VideoQA results on NExT-GQA. *: pre-trained on video-language grounding datasets.

Model Acc@QA Acc@GQA DeVi 70.8 26.9         w/o Hierarchical Dense Captioning 66.9 23.3         w/o Temporal Contextualizing 68.8 25.3         w/o Consistency Checking 66.3 21.7

Table 6: Major model ablation on DeVE-QA. We ablate the components by removing one at a time.

We first conduct an ablation to the 3 major designs in DeVi on DeVE-QA. Table 6 shows that all three components significantly contribute to DeVi’s success. Specifically, by substituting the hierarchical event captioning with a normal dense video captioning used in LLoVi (Zhang et al. 2023a), the results in Table 6 show that both QA and GQA accuracy decline remarkable by 3.9% and 3.6%. Moreover, the ablation comparison in Table 7 demonstrate that without hierarchical event captioning strategy, DeVi ’s performance on dense events drops apparently (e.g., -4.4% on QA) compared to single and double events (e.g., -1.6% and -3.4% on QA). We speculate that this demonstrate its ability of capturing specific information from different scales in multiple and complicated events. Then, we remove the temporal event contextualization module. The results again degrade by 2.0% on QA and 1.6% on GQA. This is understandable as contextualized captions are rectified with potential misunderstanding and incompletion that might arise from isolation captioning. Moreover, the ablation results (e.g., -2.3% on long video GQA vs.-1.1% on short video GQA ) in Table 8 also justify its ability on longer videos. Finally, we remove the self-consistency checking module and apply an intuitive way to prompt LLMs for final predictions. We find that the QA and especially GQA accuracy degenerate significantly by 5.2%, suggesting that a large amount of answers only ”guess” the answer and provide irrelevant video segments. Naturally, these answers could not be found and corrected without self-consistency checking process.

To better illustrate the advantage of DeVi, we present an example on DeVE-QA  in Figure 6. The comparison of QA and grounding results between different models demonstrate the efficacy of DeVi, as well as our design with self-consistency checking (in temporal grounding) and hierarchical dense captioning (in dense event QA). Specifically, self-consistency checking is effective in correcting wrongly grounded segments. Hierarchical dense captioning is helpful for event-grounded QA. Temporal contextualizing helps improve QA and grounding as well.

Metrics Model Event Density Single Double Dense Total Acc@QA FrozenBiLM(NG+) 62.1 61.8 59.2 61.6 SeViLA 63.3 62.9 61.7 62.7 LLoVi 65.2 65.8 61.2 63.9 DeVi w/o HDC 66.2 65.5 67.1 66.9 DeVi 67.8 68.9 71.5 70.8 Acc@GQA FrozenBiLM(NG+) 15.1 15.0 13.9 14.5 SeViLA 15.9 16.1 16.2 16.1 LLoVi 24.1 22.6 21.1 22.8 DeVi w/o HDC 23.5 23.3 24.2 23.3 DeVi 25.5 26.4 28.2 26.9

Table 7: Results w.r.t. different event densities. Single/Double/Dense-Event: 1/2/more than 2 main event(s) is/are present in the related videos. 200 videos are selected for each event-density level, respectively. HDC: Hierarchical dense captioning.
Refer to caption
Figure 5: Analysis of DeVi. (a) Hierarchy layers analysis. (b) Video hierarchical segment length analysis. (c) MLLM reasoning backbone analysis on DeVE-QA. (d) QA and GQA accuracy w.r.t. cross-modal similarity threshold σ𝜎\sigmaitalic_σ.
Refer to caption
Figure 6: Prediction visualization on DeVE-QA. Baseline models like SeViLA and Temp[CLIP] tend to answer the question without truly grounding it to related video segments. Hierarchical Dense Captioning (HDC) is useful to improve QA. Temporal Contextualizing (TC) helps improve GQA. Self-consistency checking (SC) is effective in correcting wrongly grounded segments.

Metrics Model Video Length Short Medium Long Total Acc@QA SeViLA 64.2 62.4 60.6 62.7 LLoVi 66.0 64.1 62.8 63.9 DeVi w/o TC 68.9 68.8 68.8 68.8 DeVi 70.1 70.8 71.7 70.8 Acc@GQA SeViLA 18.4 16.2 14.9 16.1 LLoVi 24.7 22.4 21.1 22.8 DeVi w/o TC 25.4 25.5 25.2 25.3 DeVi 25.5 26.8 27.5 26.9

Table 8: Results w.r.t. different video lengths. Short/Medium/Long: videos that are 0-60/60-120/more than 120 seconds. 200 videos are selected for each event-density level, respectively. TC: Temporal Contextualizing.

To better dissect the models’ behavior in answering questions about videos with different event density and lengths, we conduct additional evaluation on video subsets with different event numbers and lengths in Table 7 and 8, respectively. Table 7 delivers an interesting finding: The accuracy of existing MLLMs decreases with the increase of event density, whereas DeVi’s accuracy increases. This clearly demonstrates DeVi’s strength in coping with dense-event videos. Also, we analyze performance with different length of videos to better justify DeVi’s long-range temporal ability, as shown in Table 8. Apparently, DeVi increases its performance when videos become longer, while other baseline models decreases visibly. This unequivocally shows DeVi’s proficiency in handling lengthy videos. Additionally, the results in Table 7 and 8 highlight the importance of hierarchical dense event captioning and temporal event contextualizing for handing dense events and long videos respectively.

Implementation Investigation

      Caption Model       Acc@QA       Acc@GQA       VideoBLIP       62.1       22.0       VideoBLIP w HDC       64.2       23.9       Video-LLaVA       68.9       25.6       Video-LLaVA w HDC       70.8       26.9

Table 9: Captioner ablation.

Dense Video Event Captioner Table 9 shows that a substitution of Video-LLaVA with VideoBLIP deteriorates the accuracy by near 4% and 7% for QA with and without grounding respectively. We speculate that apart from the larger size of Video-LLaVA, its unified mapping mechanism for visual and textual features allows for better visual context understanding. Plus, its comprehensive pretraining strategy brings robustness for analyzing different domain videos, thus resulting in more accurate caption generation.

Then we further analyze the influence of hierarchy level and segmentation length on DeVE-QA. As depicted in Figure 5(a), the results peak at 3 hierarchy layers; the hyperparameters are finalized to be 15s, 35s, and 65s with experiments. Additionally, we observe from Figure  5(b) that increasing segment length brings better GQA accuracy (G2 & G3), indicating that it is influenced by the nature of datasets (overall duration, timestamps, etc.).

LLM Backbone Figure 5(c) shows that GPT-4o achieves the best performance (70.8% for QA and 26.9% for GQA), followed by Gemini (69.3%) and Video-LLaVA (64.8%). These results again suggest that stronger LLMs (e.g., GPT-4o) are key to success, as indicated by the remarkable margins in both GQA and QA accuracy between GPT-4o and other alternatives. We also observe that the GQA accuracy improves when increasing LLM size of the same model (e.g., from 10.9% of LLama2-7B to 4.6% of LLama2-13B). We speculate that larger model is more adept at understanding nuanced relationships within the video content and this further demonstrates our choice of large models.

Conclusion

In this paper, we proposed to study question answering on dense video events to challenge the MLLMs from three aspects of dense-event captioning, long-form video understanding, and faithful multimodal reasoning by grounding. We constructed the DeVE-QA dataset with manual efforts and proposed DeVi model. DeVi is a training-free MLLM approach that solves the aforementioned challenges by a set of tailored practices, including hierarchical dense event captioning, temporal event contextualizing and memoring, and trustworthy QA with self-consistency checking. Our extensive experiments demonstrate the effectiveness and superiority of DeVi in performing QA in the context of dense video events. We also share some implementation alternatives and highlight the power of larger MLLMs for our success. With these efforts, we hope this work provides a solid foundation for QA research on dense video events.

References

  • Alayrac et al. (2022) Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716–23736.
  • Bai, Wang, and Chen (2024) Bai, Z.; Wang, R.; and Chen, X. 2024. Glance and focus: Memory prompting for multi-event video question answering. Advances in Neural Information Processing Systems, 36.
  • Chen, Li, and Hu (2020) Chen, H.; Li, J.; and Hu, X. 2020. Delving deeper into the decoder for video captioning. In ECAI 2020, 1079–1086. IOS Press.
  • Dai et al. (2023) Dai, W.; Li, J.; Li, D.; Tiong, A. M. H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; and Hoi, S. 2023. InstructBLIP: towards general-purpose vision-language models with instruction tuning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, 49250–49267.
  • Jang et al. (2017) Jang, Y.; Song, Y.; Yu, Y.; Kim, Y.; and Kim, G. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2758–2766.
  • Johnson, Karpathy, and Fei-Fei (2016) Johnson, J.; Karpathy, A.; and Fei-Fei, L. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4565–4574.
  • Kim et al. (2024) Kim, W.; Choi, C.; Lee, W.; and Rhee, W. 2024. An image grid can be worth a video: Zero-shot video question answering using a vlm. arXiv preprint arXiv:2403.18406.
  • Ko et al. (2023) Ko, D.; Lee, J. S.; Kang, W.; Roh, B.; and Kim, H. J. 2023. Large language models are temporal and causal reasoners for video question answering. arXiv preprint arXiv:2310.15747.
  • Krishna et al. (2017) Krishna, R.; Hata, K.; Ren, F.; Fei-Fei, L.; and Carlos Niebles, J. 2017. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, 706–715.
  • Lei et al. (2018) Lei, J.; Yu, L.; Bansal, M.; and Berg, T. L. 2018. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696.
  • Li et al. (2023a) Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023a. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, 19730–19742. PMLR.
  • Li et al. (2023b) Li, K.; He, Y.; Wang, Y.; Li, Y.; Wang, W.; Luo, P.; Wang, Y.; Wang, L.; and Qiao, Y. 2023b. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355.
  • Li et al. (2023c) Li, K.; Wang, Y.; He, Y.; Li, Y.; Wang, Y.; Liu, Y.; Wang, Z.; Xu, J.; Chen, G.; Luo, P.; et al. 2023c. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005.
  • Li, Wang, and Jia (2023) Li, Y.; Wang, C.; and Jia, J. 2023. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043.
  • Lin et al. (2023) Lin, B.; Zhu, B.; Ye, Y.; Ning, M.; Jin, P.; and Yuan, L. 2023. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
  • Lin et al. (2022) Lin, K.; Li, L.; Lin, C.-C.; Ahmed, F.; Gan, Z.; Liu, Z.; Lu, Y.; and Wang, L. 2022. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17949–17958.
  • Liu et al. (2024) Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2024. Visual instruction tuning. Advances in neural information processing systems, 36.
  • Liu, Li, and Lin (2023) Liu, Y.; Li, G.; and Lin, L. 2023. Cross-modal causal relational reasoning for event-level visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10): 11624–11641.
  • Ma et al. (2023) Ma, F.; Jin, X.; Wang, H.; Xian, Y.; Feng, J.; and Yang, Y. 2023. Vista-llama: Reliable video narrator via equal distance to visual tokens. arXiv preprint arXiv:2312.08870.
  • Maaz et al. (2023) Maaz, M.; Rasheed, H.; Khan, S.; and Khan, F. S. 2023. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424.
  • Mangalam, Akshulakov, and Malik (2024) Mangalam, K.; Akshulakov, R.; and Malik, J. 2024. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36.
  • Momeni et al. (2023) Momeni, L.; Caron, M.; Nagrani, A.; Zisserman, A.; and Schmid, C. 2023. Verbs in action: Improving verb understanding in video-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 15579–15591.
  • OpenAI (2024a) OpenAI. 2024a. GPT-4.
  • OpenAI (2024b) OpenAI. 2024b. Hello GPT-4o.
  • Papineni et al. (2002) Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318.
  • Patraucean et al. (2024) Patraucean, V.; Smaira, L.; Gupta, A.; Recasens, A.; Markeeva, L.; Banarse, D.; Koppula, S.; Malinowski, M.; Yang, Y.; Doersch, C.; et al. 2024. Perception test: A diagnostic benchmark for multimodal video models. Advances in Neural Information Processing Systems, 36.
  • Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  • Reid et al. (2024) Reid, M.; Savinov, N.; Teplyashin, D.; Lepikhin, D.; Lillicrap, T.; Alayrac, J.-b.; Soricut, R.; Lazaridou, A.; Firat, O.; Schrittwieser, J.; et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
  • Ren et al. (2023) Ren, S.; Yao, L.; Li, S.; Sun, X.; and Hou, L. 2023. TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding. arXiv preprint arXiv:2312.02051.
  • Rohrbach et al. (2018) Rohrbach, A.; Hendricks, L. A.; Burns, K.; Darrell, T.; and Saenko, K. 2018. Object Hallucination in Image Captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4035–4045.
  • Surís, Menon, and Vondrick (2023) Surís, D.; Menon, S.; and Vondrick, C. 2023. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11888–11898.
  • Vedantam, Lawrence Zitnick, and Parikh (2015) Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566–4575.
  • Wang, Deng, and Jia (2024) Wang, N.; Deng, J.; and Jia, M. 2024. Cycle-Consistency Learning for Captioning and Grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 5535–5543.
  • Wang et al. (2018) Wang, X.; Chen, W.; Wu, J.; Wang, Y.-F.; and Wang, W. Y. 2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4213–4222.
  • Wang et al. (2023) Wang, Y.; He, Y.; Li, Y.; Li, K.; Yu, J.; Ma, X.; Li, X.; Chen, G.; Chen, X.; Wang, Y.; et al. 2023. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942.
  • Xiao et al. (2024a) Xiao, J.; Huang, N.; Qin, H.; Li, D.; Li, Y.; Zhu, F.; Tao, Z.; Yu, J.; Lin, L.; Chua, T.-S.; and Yao, A. 2024a. VideoQA in the Era of LLMs: An Empirical Study. arXiv preprint arXiv:2408.04223.
  • Xiao et al. (2021) Xiao, J.; Shang, X.; Yao, A.; and Chua, T.-S. 2021. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9777–9786.
  • Xiao et al. (2024b) Xiao, J.; Yao, A.; Li, Y.; and Chua, T.-S. 2024b. Can i trust your answer? visually grounded video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13204–13214.
  • Xu et al. (2017) Xu, D.; Zhao, Z.; Xiao, J.; Wu, F.; Zhang, H.; He, X.; and Zhuang, Y. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, 1645–1653.
  • Xu et al. (2024) Xu, L.; Zhao, Y.; Zhou, D.; Lin, Z.; Ng, S. K.; and Feng, J. 2024. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994.
  • Yang et al. (2021) Yang, A.; Miech, A.; Sivic, J.; Laptev, I.; and Schmid, C. 2021. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF international conference on computer vision, 1686–1697.
  • Yang et al. (2023) Yang, A.; Nagrani, A.; Seo, P. H.; Miech, A.; Pont-Tuset, J.; Laptev, I.; Sivic, J.; and Schmid, C. 2023. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10714–10726.
  • Yin et al. (2023) Yin, C.; Che, Z.; Wu, K.; Xu, Z.; Qiu, Q.; and Tang, J. 2023. Cross-Modal Reasoning with Event Correlation for Video Question Answering. arXiv preprint arXiv:2312.12721.
  • Yu et al. (2024) Yu, S.; Cho, J.; Yadav, P.; and Bansal, M. 2024. Self-chained image-language model for video localization and question answering. Advances in Neural Information Processing Systems, 36.
  • Yu et al. (2019) Yu, Z.; Xu, D.; Yu, J.; Yu, T.; Zhao, Z.; Zhuang, Y.; and Tao, D. 2019. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 9127–9134.
  • Zhang et al. (2023a) Zhang, C.; Lu, T.; Islam, M. M.; Wang, Z.; Yu, S.; Bansal, M.; and Bertasius, G. 2023a. A simple llm framework for long-range video question-answering. arXiv preprint arXiv:2312.17235.
  • Zhang, Li, and Bing (2023) Zhang, H.; Li, X.; and Bing, L. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
  • Zhang et al. (2023b) Zhang, R.; Han, J.; Liu, C.; Gao, P.; Zhou, A.; Hu, X.; Yan, S.; Lu, P.; Li, H.; and Qiao, Y. 2023b. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.

Appendix A: DeVE-QA Dataset Construction

ActivityNet-Captions dataset (Krishna et al. 2017) is the data source of DeVE-QA. It contains 20k videos amounting to 849 hours with 100k descriptions, each with it’s unique start and end timestamps. On average, the captions for each video describe 94.6% of the entire video content (Johnson, Karpathy, and Fei-Fei 2016), demonstrating that each caption annotation could cover the corresponding major events within the video. Furthermore, 10% of the temporal descriptions overlap with each other, showing that the events cover simultaneous events. By selecting ActivityNet-Captions as our data source, we first conduct raw data filtering with filter criteria that 1) the descriptions should be more than 10 words, and 2) captions for each video cover at least 95% of the video. Then we perform random sampling over all the ActivityNet-Captions to get the final subset of 10,643 videos and 26,111 captions.

Automatic QA Generation

During the generation process, we first perform automated QA generation with dense event captions by prompting GPT-4.0. Specifically, we feed 26,111 event captions of ActivityNet-Captions into GPT-4.0, and prompt it to generate multiple (maximal 3 to limit the cost) different question-answer pairs pertaining to different aspects of a particular event caption.

During the QA generation process, we also perform analysis on one-shot vs. n-shot prompting strategy. To be specific, one-shot strategy prompts once for all N captions It is cost-efficient by sending less tokens to GPT-4. However, the generated questions appear to be of low quality and are often similar to each other. Alternatively, n-shot strategy separately prompts for each caption. It is relatively cost-inefficient compared to one-shot because of the attached prompt, but it significantly improves the generated QA quality. We speculate that N-shot prompting is able to utilize more tailored and content-specific information from each caption for generating questions. Moreover, it is likely that the one-shot prompting generate questions by using the information from all N captions simultaneously, despite these captions being originally intended to be separate entities. quality because it allows for more tailored and context-specific questions for each caption, reducing redundancy and enhancing the diversity and relevance of the generated questions. Considering the quality, we eventually opt for the n-shot prompting strategy.

You are a good question generator. I need your help in generating question-answer pairs pertaining to the visual event descriptions. Below are the examples: Given description: An elderly man is playing the piano in front of a crowd. Good generated Question-Answer (QA) pairs can be: Q: What is the elderly man doing in front of a crowd? A: Playing the piano. Q: Why is a crowd in front of an elderly man? A: Watch him playing the piano. Q: How did the elderly man attract the crowd? A: Playing the piano. Given description: A woman walks to the piano and briefly talks to the elderly man. Good QAs can be: Q: Why did the woman walk to the piano? A: Talks to the elderly man. Q: What does the woman do before talking to the elderly man? A: Walk to the piano. Q: What does the woman do after walking to the piano? A: Talks to the elderly man. Please generate up to 3 QA pairs for each description, and limit the generated questions to a maximal 22 words while the answers to a maximal 6 words. I hope your questions feature different causal and temporal reasoning keywords such as ’why’ and ’how’, ’before’ and ’after’. Different questions should be diverse and be related to different aspects of the described events. Also, make sure the answer is correct according to the description. … Please label each question in sequence. Here are the descriptions: {descriptions}.
Table 10: Prompt for question generation.

Distractor Answers Retrieval

After the QA generation process, question and corresponding correct answers are obtained. To curate the distractor answers and form multiple choices, we incorporate the following steps: For each question, we first retrieve its Top-10 similar questions and use their correct answers as candidate wrong answers. In particular, the Top-10 similar questions is obtained by the similarity of first 3 words which indicate both the question types and the subject of activities. To ensure hard negatives, we additionally filter for video-irrelevant candidate answers. Specifically, for each question and its corresponding temporal segment, we sample video frames that are outside this temporal segment (covering its left or right parts) and use them to further retrieve the candidate answers by calculating the cross-modal similarity between frames and other related QA pairs. Finally, we select two such candidate answers that are relevant to the video but not the target segment, thus encouraging temporal grounding to answer the questions. To further encourage spatial reasoning, we include one candidate answer that is related to the segment but is wrong regarding the question. Finally, we randomly select one candidate answer from the Top-K answer list to form 5 options including the correct answer for each question. Note that for all questions, the correct answers are randomly but evenly inserted into the 5 options. Then we also perform QA filtration to remove meaningless questions and also analyze the key activities inside the videos.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Manual curation examples.
Refer to caption
Figure 8: Word cloud for frequent words in answers of the training (a) and (b) validation set.
Refer to caption
Figure 9: Detailed prediction visualization on DeVE-QA.

Manual QA Checking and Curation

As all QAs are automatically generated, manual curation process is necessary to ensure the quality of questions and effectivesness of the candidate answers. As such, we perform manual checking and correction following the requirements: 1) All 4 distractor answers should not be potential correct answers. 2) The distract answers should logically answer the given question, do not overlap each other, and be closely related to the video content. We particularly emphasize on checking potential correct distractor answers that might lead to confusing and controversial results. The checking process involves 35 volunteers with 267 hours spent, and around 74% QA pairs are modified. Figure 7 shows some manual curation examples of overlap answers, potential correct distractor answers and logically irrelevant answers.

DeVE-QA  Examples

Figure  10 shows some examples in DeVE-QA.

Refer to caption
Refer to caption
Figure 10: DeVE-QA  examples.

Appendix B:DeVi Design and Analysis

You are a helpful expert in dense event video analysis. Given multiple clips {video_clips} of different temporal length from a video, please provide a caption for each clip, focusing on capturing all the dense events and activities occurring within it. Your caption should succinctly describe the sequence of actions, highlighting key movements, interactions, and significant moments. Be detailed and descriptive, providing context for the viewer to understand the intensity and intricacy of the events unfolding.
Table 11: Prompt for hierarchical dense event captioning.
You are a highly intelligent language agent in improving the quality of video captions. Given a set of captions (each representing a different time segment of a video) and a question of a video, you are required to refine each caption by incorporating contextual information from all the other captions and question via analyzing the overall narrative, identifying relevant context and incorporate context with coherence. Here are the captions and questions: ${hierarchical captions} and {question}. Here are the examples: Original Caption: A person is holding a knife and waving it around. Contextualized Caption: A person is holding a knife and chopping down a tree. Original Caption: A person takes off their clothes by the river and jumps into the water to swim. Contextualized Caption: A person takes off their clothes by the river and jumps into the water to save someone who is drowning. Original Caption: A person is waving a spatula in the kitchen. Contextualized Caption: A person is using a spatula in the kitchen to chase away a squirrel that has entered. After that, please provide a comprehensive synopsis according to all the captions of the entire video, with all key temporal actions, characters and interactions included.
Table 12: Prompt of temporal contextualization.
You are a helpful expert in dense event video analysis. I will provide some video descriptions and one multiple-choice question about the video. The descriptions have three different levels of lengths, which are differentiated by labels. Specifically, labels with ”S” mean the descriptions are the captions every Lssubscript𝐿𝑠L_{s}italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT seconds; labels with ”M” mean the descriptions are the captions every Lmsubscript𝐿𝑚L_{m}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT seconds; and labels with ”L” mean the descriptions are the captions every Llsubscript𝐿𝑙L_{l}italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT seconds. The descriptions are sequential and non-overlapping which cover the whole video exactly. Here are some examples: {caption_groups_examples}. The video is {dur} seconds long. Please select a correct answer from {C} to the question {q} based on the event descriptions {event_captions} and also provide the minimum time interval [ts,te]subscript𝑡𝑠subscript𝑡𝑒[t_{s},t_{e}][ italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ] of the event that carries the correct answer… Your answer must follow this format: Answer (A, B, C, D, or E), [frame_start_index, frame_end_index]. Here are some examples: #Example1: A, [5, 19] #Example2: B, [30, 60] #Example3: C, [1, 10] and [50, 60]. You must not provide any other response or explanation.
Table 13: Prompt for Event-Grounded QA.
You are a helpful expert in dense event video analysis. You have been provided with video descriptions and one multiple-choice question about the video and gave out your answer and the minimum frame(s) interval to support. However, after our professional check, we consider your answer inconsistent because the self-similarity between your previous answer {Previous_Answer} and {Supportive_Frames} is only {Self_Consistency_Score}. On this premise, I want you to answer this question again: {Prompts_for_Event-Grounded_QA} and judge whether your answer is consistent with the previous one. If no, analyze the inconsistency in detail. If yes, explain how the answer relates to the video frames.
Table 14: Prompt for dynamic verification.

Hierarchical Dense Event Captioning

The dense events within videos often intertwine and vary in duration, posing a challenge for machines to accurately segment them for captioning. We propose the hierarchical dense event captioning approach to gain comprehensive understanding of events over different time scales. Specifically, our DeVi first samples video in three hierarchical length-levels (e.g., 15s, 35, and 65s for DeVE-QA) sequentially with no overlaps, then 5/7/13 frames are sampled uniformly from each video segments and sent to Video-LLaVA (Lin et al. 2023) to produce segment captions with designed prompt for different length-level of video segments to captures different level of event information. Full prompts are shown in Table 11.

Temporal Event Memory

We design the temporal event memory that contextualizes event captions while also storing the original visual and event representations to capture long-term dependencies between events. To be specific, the hierarchical video event captions {E}𝐸\{E\}{ italic_E } are initially updated to the temporal event memory. At the same time, we sample the original with 1 fps and the video encoder CLIP VIT-L/14 (Radford et al. 2021) are also applied to extract visual features fvsubscript𝑓𝑣f_{v}italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT from the original entire video and store them in the temporal event memory. The visual features will be read by the self-consistency checking module to estimate the cross-modal similarity with predicted answers.

After that, we try to catch the long-term relationship between events by prompting LLMs to get enhanced video event captions {E}superscript𝐸\{E^{{}^{\prime}}\}{ italic_E start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } with the entire video context. We instruct the LLM to refine each event caption using information from all other event captions and any given question, focusing on understanding the overall story and incorporating relevant details coherently. To aid the LLM, we also provide examples for in-context learning before generating captions (see Table 12). Furthermore, we also ask the LLM to create the synopsis eysubscript𝑒𝑦e_{y}italic_e start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT of all videos (see Table 12) to enhance the contextualized event captions, which also serves as a global overview of the entire video. This expanded set of events, including the synopsis, improves the understanding of event relationships across different time scales.

Overall, the temporal event memory M={E,E,fv}𝑀𝐸superscript𝐸subscript𝑓𝑣M=\{E,E^{{}^{\prime}},f_{v}\}italic_M = { italic_E , italic_E start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } describes and links the relevant occurrence of dense events from long-ranged time periods to aid in answering questions and grounding results about specific events.

Self-inconsistency in Event-Grounded QA

Formally as described in the main text, we evaluate the self-consistency based on the cosine similarity R𝑅Ritalic_R between the answer a𝑎aitalic_a and the video features within time span [ts,te]subscript𝑡𝑠subscript𝑡𝑒[t_{s},t_{e}][ italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ], and compare it with threshold σ𝜎\sigmaitalic_σ. Then, we conduct an error analysis based on over three hundred samples from the DeVE-QAdataset. Specifically, we let volunteers to manually check the predicted GQA results together with the videos, captions, synopsis, etc. and annotate the error reason. The results show that 82% of errors are originated from the event-grounded QA process, while less than 10% are attributed to caption quality and 8% are from others (including synopsis, meaningless answers, etc.). These findings not only validate the effectiveness of the hierarchical dense event captioning strategy but also highlight the challenges of DE-VideoQAtasks in both question answering and temporal grounding.

Therefore, we focus on a better LMM-prompt in the last stage of event-grounded QA with the feedback from self-consistency checking. Specifically, we craft the dynamic verification prompt as shown in Table 14. When the similarity score Rvtsubscript𝑅𝑣𝑡R_{vt}italic_R start_POSTSUBSCRIPT italic_v italic_t end_POSTSUBSCRIPT is smaller than σ𝜎\sigmaitalic_σ, DeVi will resubmit the captions and QA pair together with this dynamic verification prompt to LLM, thus efficient in improving the reliability and transparency of the model’s responses. In particular, the dynamic verification prompt is designed to feedback the self-consistency checking results between the LLM’s answer prediction and the supportive video evidence from the previous round. The model is then required to re-answer the question with the given extra information. If the results from the two rounds are consistent, the model needs to elaborate on the relationship between the answer and the video segments. Otherwise, it is required to explain the reasons for the inconsistency. Through the process of justifying its answers, we speculate that the LLM could consider the underlying logic and relationships behind, which can lead to more accurate and contextually relevant responses to improve the GQA accuracy.

Overall, this process helps the model identify and correct potential errors, as the model cross-checks its reasoning against the given previous prediction, ultimately enhancing its performance in the GQA task that requires complex understanding and decision-making.

Further Analysis

      Time usage (s)       QA       GQA       DeVi       1.83       2.12       LLoVi       1.43       1.68

Table 15: Efficiency analysis by time usage.
Event sample.

To further demonstrate the mechanism behind DeVi, we visualize an example in Figure 9. According to the example, we can find that baseline models like SeViLA and Temp[CLIP] tend to answer the question without truly grounding it to related video segments. Hierarchical Dense Captioning (HDC) helps DeVi  further understand the events in different scales, Temporal Contextualizing (TC) helps improve GQA with the ability of refining or correcting the isolated captions according to related context, and Self-consistency checking (SC) is effective in correcting wrongly grounded segments.

Efficiency analysis.

To evaluate the efficiency of DeVi, we also conduct experimental analysis on the time consumption of DeVi  and LLoVi (experiments are performed on NVIDIA A800 GPU). Specifically, we randomly sample one thousand samples from DeVE-QA  and evaluate their response time on both QA task and GQA task, and the results are shown in Table 15. We can observe that DeVi  and LLoVi reaches roughly the same efficiency on both GQA and QA task, whereas DeVi  cost slightly more time. We speculate that this result from the more comprehensive mechanism inside DeVi , especially the self-inconsistency checking that may leads to multiple-round reasoning. Moreover, GQA task cost more time in both model.