(Translated by https://www.hiragana.jp/)
[2406.03447] FILS: Self-Supervised Video Feature Prediction In Semantic Language Space