(Translated by https://www.hiragana.jp/)
[2401.11740] Multi-level Cross-modal Alignment for Image Clustering