(Translated by https://www.hiragana.jp/)
[2310.14278] Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation