(Translated by https://www.hiragana.jp/)
[2310.03456] Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization