video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Sun, Guangzhi; Yu, Wenyi; Tang, Changli; Chen, Xianzhao; Tan, Tian; Li, Wei; Lu, Lu; Ma, Zejun; Wang, Yuxuan; Zhang, Chao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.15704v1 (cs)

[Submitted on 22 Jun 2024]

Title:video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Authors:Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

View PDF HTML (experimental)

Abstract:Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model. Moreover, dedicated training approaches including the diversity loss and the unpaired audio-visual mixed training scheme are proposed to avoid frames or modality dominance. On the introduced speech-audio-visual evaluation benchmark, video-SALMONN achieves more than 25\% absolute accuracy improvements on the video-QA task and over 30\% absolute accuracy improvements on audio-visual QA tasks with human speech. In addition, video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs. Our training code and model checkpoints are available at \texttt{\url{this https URL}}.

Comments:	Accepted at ICML 2024. arXiv admin note: substantial text overlap with arXiv:2310.05863
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.15704 [cs.CV]
	(or arXiv:2406.15704v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.15704

Submission history

From: Changli Tang [view email]
[v1] Sat, 22 Jun 2024 01:36:11 UTC (15,386 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators