(Translated by https://www.hiragana.jp/)
[2302.02447] cross-modal fusion techniques for utterance-level emotion recognition from text and speech