Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Wei, Kun; Li, Bei; Lv, Hang; Lu, Quan; Jiang, Ning; Xie, Lei

doi:10.1109/TASLP.2024.3389630

Computer Science > Sound

arXiv:2310.14278 (cs)

[Submitted on 22 Oct 2023 (v1), last revised 28 Apr 2024 (this version, v2)]

Title:Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Authors:Kun Wei, Bei Li, Hang Lv, Quan Lu, Ning Jiang, Lei Xie

View PDF HTML (experimental)

Abstract:Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.

Comments:	TASLP
Subjects:	Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2310.14278 [cs.SD]
	(or arXiv:2310.14278v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2310.14278
Journal reference:	IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024
Related DOI:	https://doi.org/10.1109/TASLP.2024.3389630

Submission history

From: Kun Wei [view email]
[v1] Sun, 22 Oct 2023 11:57:33 UTC (479 KB)
[v2] Sun, 28 Apr 2024 03:50:18 UTC (447 KB)

Computer Science > Sound

Title:Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators