Multistage linguistic conditioning of convolutional layers for speech emotion recognition

Triantafyllopoulos, Andreas; Reichel, Uwe; Liu, Shuo; Huber, Stephan; Eyben, Florian; Schuller, Björn W.

doi:10.3389/fcomp.2023.1072479

Computer Science > Machine Learning

arXiv:2110.06650v2 (cs)

[Submitted on 13 Oct 2021 (v1), last revised 4 May 2022 (this version, v2)]

Title:Multistage linguistic conditioning of convolutional layers for speech emotion recognition

Authors:Andreas Triantafyllopoulos, Uwe Reichel, Shuo Liu, Stephan Huber, Florian Eyben, Björn W. Schuller

View PDF

Abstract:In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER). We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Experiments on the MSP-Podcast and IEMOCAP datasets demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behaviour. Overall, our multistage fusion shows better quantitative performance, surpassing alternatives on most of our evaluations. This illustrates the potential of multistage fusion in better assimilating text and audio information.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2110.06650 [cs.LG]
	(or arXiv:2110.06650v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2110.06650
Journal reference:	Frontiers in Computer Science, Volume 5, 2023
Related DOI:	https://doi.org/10.3389/fcomp.2023.1072479

Submission history

From: Andreas Triantafyllopoulos [view email]
[v1] Wed, 13 Oct 2021 11:28:04 UTC (3,632 KB)
[v2] Wed, 4 May 2022 13:33:20 UTC (340 KB)

Computer Science > Machine Learning

Title:Multistage linguistic conditioning of convolutional layers for speech emotion recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Multistage linguistic conditioning of convolutional layers for speech emotion recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators