(Translated by https://www.hiragana.jp/)
[2402.16021] TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages