SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Authors:
Holy Lovenia,
Rahmad Mahendra,
Salsabil Maulana Akbar,
Lester James V. Miranda,
Jennifer Santoso,
Elyanah Aco,
Akhdan Fadhilah,
Jonibek Mansurov,
Joseph Marvin Imperial,
Onno P. Kampman,
Joel Ruben Antony Moniz,
Muhammad Ravi Shulthan Habibi,
Frederikus Hudi,
Railey Montalan,
Ryan Ignatius,
Joanito Agili Lopo,
William Nixon,
Börje F. Karlsson,
James Jaya,
Ryandito Diandaru,
Yuze Gao,
Patrick Amadeus,
Bin Wang,
Jan Christian Blaise Cruz,
Chenxi Whitehouse
, et al. (36 additional authors not shown)
Abstract:
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due t…
▽ More
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.
△ Less
Submitted 8 July, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Authors:
Samuel Cahyawijaya,
Holy Lovenia,
Alham Fikri Aji,
Genta Indra Winata,
Bryan Wilie,
Rahmad Mahendra,
Christian Wibisono,
Ade Romadhony,
Karissa Vincentio,
Fajri Koto,
Jennifer Santoso,
David Moeljadi,
Cahya Wirawan,
Frederikus Hudi,
Ivan Halim Parmonangan,
Ika Alfina,
Muhammad Satrio Wicaksono,
Ilham Firdausi Putra,
Samsul Rahmadani,
Yulianti Oenang,
Ali Akbar Septiandri,
James Jaya,
Kaustubh D. Dhole,
Arie Ardiyanti Suryani,
Rifki Afina Putri
, et al. (22 additional authors not shown)
Abstract:
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple exp…
▽ More
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
△ Less
Submitted 21 July, 2023; v1 submitted 19 December, 2022;
originally announced December 2022.