(Translated by https://www.hiragana.jp/)
[2203.04114] A study on joint modeling and data augmentation of multi-modalities for audio-visual scene classification