cross-modal fusion techniques for utterance-level emotion recognition from text and speech

Luo, Jiachen; Phan, Huy; Reiss, Joshua

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2302.02447 (eess)

[Submitted on 5 Feb 2023]

Title:cross-modal fusion techniques for utterance-level emotion recognition from text and speech

Authors:Jiachen Luo, Huy Phan, Joshua Reiss

View PDF

Abstract:Multimodal emotion recognition (MER) is a fundamental complex research problem due to the uncertainty of human emotional expression and the heterogeneity gap between different modalities. Audio and text modalities are particularly important for a human participant in understanding emotions. Although many successful attempts have been designed multimodal representations for MER, there still exist multiple challenges to be addressed: 1) bridging the heterogeneity gap between multimodal features and model inter- and intra-modal interactions of multiple modalities; 2) effectively and efficiently modelling the contextual dynamics in the conversation sequence. In this paper, we propose Cross-Modal RoBERTa (CM-RoBERTa) model for emotion detection from spoken audio and corresponding transcripts. As the core unit of the CM-RoBERTa, parallel self- and cross- attention is designed to dynamically capture inter- and intra-modal interactions of audio and text. Specially, the mid-level fusion and residual module are employed to model long-term contextual dependencies and learn modality-specific patterns. We evaluate the approach on the MELD dataset and the experimental results show the proposed approach achieves the state-of-art performance on the dataset.

Comments:	6 pages, 2 figures
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2302.02447 [eess.AS]
	(or arXiv:2302.02447v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2302.02447

Submission history

From: Jiachen Luo [view email]
[v1] Sun, 5 Feb 2023 18:16:12 UTC (1,023 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:cross-modal fusion techniques for utterance-level emotion recognition from text and speech

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:cross-modal fusion techniques for utterance-level emotion recognition from text and speech

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators