(Translated by https://www.hiragana.jp/)
[2407.06438] A Single Transformer for Scalable Vision-Language Modeling