(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 73 results for author: Winata, G I

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.10118  [pdf, other

    cs.CL

    SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

    Authors: Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse , et al. (36 additional authors not shown)

    Abstract: Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due t… ▽ More

    Submitted 8 July, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: https://github.com/SEACrowd

  2. arXiv:2406.09334  [pdf, other

    cs.CL

    ProxyLM: Predicting Language Model Performance on Multilingual Tasks via Proxy Models

    Authors: David Anugraha, Genta Indra Winata, Chenyue Li, Patrick Amadeus Irawan, En-Shiun Annie Lee

    Abstract: Performance prediction is a method to estimate the performance of Language Models (LMs) on various Natural Language Processing (NLP) tasks, mitigating computational costs associated with model capacity and data for fine-tuning. Our paper introduces ProxyLM, a scalable framework for predicting LM performance using proxy models in multilingual tasks. These proxy models act as surrogates, approximati… ▽ More

    Submitted 14 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Preprint

  3. arXiv:2406.07424  [pdf, other

    cs.CL

    MINERS: Multilingual Language Models as Semantic Retrievers

    Authors: Genta Indra Winata, Ruochen Zhang, David Ifeoluwa Adelani

    Abstract: Words have been represented in a high-dimensional vector space that encodes their semantic similarities, enabling downstream applications such as retrieving synonyms, antonyms, and relevant contexts. However, despite recent advances in multilingual language models (LMs), the effectiveness of these models' representations in semantic retrieval contexts has not been comprehensively explored. To fill… ▽ More

    Submitted 19 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: Preprint

  4. arXiv:2405.14782  [pdf, other

    cs.CL

    Lessons from the Trenches on Reproducible Evaluation of Language Models

    Authors: Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan , et al. (5 additional authors not shown)

    Abstract: Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons… ▽ More

    Submitted 29 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  5. arXiv:2404.06138  [pdf, other

    cs.CL

    Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

    Authors: Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Afina Putri, Emmanuel Dave, Jhonson Lee, Nuur Shadieq, Wawan Cenggoro, Salsabil Maulana Akbar, Muhammad Ihza Mahendra, Dea Annisayanti Putri, Bryan Wilie, Genta Indra Winata, Alham Fikri Aji, Ayu Purwarianti, Pascale Fung

    Abstract: Large language models (LLMs) show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder… ▽ More

    Submitted 7 July, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: Cendol models are released under Apache 2.0 license and will be made publicly available soon

  6. arXiv:2401.06034  [pdf, other

    cs.CL

    LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language Generalization

    Authors: Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Alham Fikri Aji, Genta Indra Winata, Ayu Purwarianti

    Abstract: Pretrained language models (PLMs) have become remarkably adept at task and language generalization. Nonetheless, they often fail when faced with unseen languages. In this work, we present LinguAlchemy, a regularization method that incorporates various linguistic information covering typological, geographical, and phylogenetic features to align PLMs representation to the corresponding linguistic in… ▽ More

    Submitted 10 June, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

  7. arXiv:2311.12405  [pdf, other

    cs.CL

    IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

    Authors: Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata, Pascale Fung, Ayu Purwarianti

    Abstract: Significant progress has been made on Indonesian NLP. Nevertheless, exploration of the code-mixing phenomenon in Indonesian is limited, despite many languages being frequently mixed with Indonesian in daily conversation. In this work, we explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay; and introduce IndoRobusta, a framework to evaluate… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

  8. arXiv:2311.00958  [pdf, other

    cs.CL cs.AI

    IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems

    Authors: Muhammad Dehan Al Kautsar, Rahmah Khoirussyifa' Nurdini, Samuel Cahyawijaya, Genta Indra Winata, Ayu Purwarianti

    Abstract: Task-oriented dialogue (ToD) systems have been mostly created for high-resource languages, such as English and Chinese. However, there is a need to develop ToD systems for other regional or local languages to broaden their ability to comprehend the dialogue contexts in various languages. This paper introduces IndoToD, an end-to-end multi domain ToD benchmark in Indonesian. We extend two English To… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: 2023 1st Workshop in South East Asian Language Processing (SEALP), Co-located with AACL 2023

  9. arXiv:2309.10661  [pdf, other

    cs.CL cs.AI

    NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

    Authors: Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel Dave, Sarah Oktavianti, Salsabil Maulana Akbar, Jhonson Lee, Nuur Shadieq, Tjeng Wawan Cenggoro, Hanung Wahyuning Linuwih, Bryan Wilie, Galih Pradipta Muridan, Genta Indra Winata, David Moeljadi, Alham Fikri Aji, Ayu Purwarianti, Pascale Fung

    Abstract: Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resul… ▽ More

    Submitted 19 September, 2023; v1 submitted 19 September, 2023; originally announced September 2023.

  10. arXiv:2306.10964  [pdf, other

    cs.CL

    Multilingual Few-Shot Learning via Language Model Retrieval

    Authors: Genta Indra Winata, Liang-Kang Huang, Soumya Vadlamannati, Yash Chandarana

    Abstract: Transformer-based language models have achieved remarkable success in few-shot in-context learning and drawn a lot of research interest. However, these models' performance greatly depends on the choice of the example prompts and also has high variability depending on how samples are chosen. In this paper, we conduct a comprehensive study of retrieving semantically similar few-shot samples and usin… ▽ More

    Submitted 19 June, 2023; originally announced June 2023.

    Comments: 9 pages

  11. arXiv:2306.02870  [pdf, ps, other

    cs.CL

    On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training Research

    Authors: Made Nindyatama Nityasya, Haryo Akbarianto Wibowo, Alham Fikri Aji, Genta Indra Winata, Radityo Eko Prasojo, Phil Blunsom, Adhiguna Kuncoro

    Abstract: This evidence-based position paper critiques current research practices within the language model pre-training literature. Despite rapid recent progress afforded by increasingly better pre-trained language models (PLMs), current PLM research practices often conflate different possible sources of model improvement, without conducting proper ablation studies and principled comparisons between differ… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Accepted at ACL 2023

  12. arXiv:2305.16252  [pdf, other

    cs.CL

    Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning

    Authors: Genta Indra Winata, Lingjue Xie, Karthik Radhakrishnan, Shijie Wu, Xisen Jin, Pengxiang Cheng, Mayank Kulkarni, Daniel Preotiuc-Pietro

    Abstract: Real-life multilingual systems should be able to efficiently incorporate new languages as data distributions fed to the system evolve and shift over time. To do this, systems need to handle the issue of catastrophic forgetting, where the model performance drops for languages or tasks seen further in its past. In this paper, we study catastrophic forgetting, as well as methods to minimize this, in… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Findings

  13. arXiv:2305.16171  [pdf

    cs.CL

    Multi-lingual and Multi-cultural Figurative Language Understanding

    Authors: Anubha Kabra, Emmy Liu, Simran Khanuja, Alham Fikri Aji, Genta Indra Winata, Samuel Cahyawijaya, Anuoluwapo Aremu, Perez Ogayo, Graham Neubig

    Abstract: Figurative language permeates human communication, but at the same time is relatively understudied in NLP. Datasets have been created in English to accelerate progress towards measuring and improving figurative language processing in language models (LMs). However, the use of figurative language is an expression of our cultural and societal experiences, making it difficult for these phrases to be… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Findings

  14. arXiv:2305.14716  [pdf, other

    cs.CL

    GlobalBench: A Benchmark for Global Progress in Natural Language Processing

    Authors: Yueqi Song, Catherine Cui, Simran Khanuja, Pengfei Liu, Fahim Faisal, Alissa Ostapenko, Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Yulia Tsvetkov, Antonios Anastasopoulos, Graham Neubig

    Abstract: Despite the major advances in NLP, significant disparities in NLP system performance across languages still exist. Arguably, these are due to uneven resource allocation and sub-optimal incentives to work on less resourced languages. To track and further incentivize the global development of equitable language technology, we introduce GlobalBench. Prior multilingual benchmarks are static and have f… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: Preprint, 9 pages

  15. arXiv:2305.14235  [pdf, other

    cs.CL cs.AI

    Multilingual Large Language Models Are Not (Yet) Code-Switchers

    Authors: Ruochen Zhang, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Indra Winata, Alham Fikri Aji

    Abstract: Multilingual Large Language Models (LLMs) have recently shown great capabilities in a wide range of tasks, exhibiting state-of-the-art performance through zero-shot or few-shot prompting methods. While there have been extensive studies on their abilities in monolingual tasks, the investigation of their potential in the context of code-switching (CSW), the practice of alternating languages within a… ▽ More

    Submitted 23 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted at EMNLP 2023

  16. arXiv:2303.13592  [pdf, other

    cs.CL cs.AI

    Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

    Authors: Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Arjun Subramonian, Holy Lovenia, Samuel Cahyawijaya, Genta Indra Winata, Lintang Sutawika, Jan Christian Blaise Cruz, Yin Lin Tan, Long Phan, Rowena Garcia, Thamar Solorio, Alham Fikri Aji

    Abstract: While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero… ▽ More

    Submitted 12 September, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

    Comments: Updating Authors

  17. arXiv:2212.09660  [pdf, other

    cs.CL

    The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges

    Authors: Genta Indra Winata, Alham Fikri Aji, Zheng-Xin Yong, Thamar Solorio

    Abstract: Code-Switching, a common phenomenon in written text and conversation, has been studied over decades by the natural language processing (NLP) research community. Initially, code-switching is intensively explored by leveraging linguistic theories and, currently, more machine-learning oriented approaches to develop models. We introduce a comprehensive systematic survey on code-switching research in n… ▽ More

    Submitted 24 May, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: ACL 2023 Findings

  18. arXiv:2212.09648  [pdf, other

    cs.CL cs.AI

    NusaCrowd: Open Source Initiative for Indonesian NLP Resources

    Authors: Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Fajri Koto, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Ivan Halim Parmonangan, Ika Alfina, Muhammad Satrio Wicaksono, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Akbar Septiandri, James Jaya, Kaustubh D. Dhole, Arie Ardiyanti Suryani, Rifki Afina Putri , et al. (22 additional authors not shown)

    Abstract: We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple exp… ▽ More

    Submitted 21 July, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

  19. arXiv:2212.09535  [pdf, other

    cs.CL cs.AI cs.LG

    BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting

    Authors: Zheng-Xin Yong, Hailey Schoelkopf, Niklas Muennighoff, Alham Fikri Aji, David Ifeoluwa Adelani, Khalid Almubarak, M Saiful Bari, Lintang Sutawika, Jungo Kasai, Ahmed Baruwa, Genta Indra Winata, Stella Biderman, Edward Raff, Dragomir Radev, Vassilina Nikoulina

    Abstract: The BLOOM model is a large publicly available multilingual language model, but its pretraining was limited to 46 languages. To extend the benefits of BLOOM to other languages without incurring prohibitively large costs, it is desirable to adapt BLOOM to new languages not seen during pretraining. In this work, we apply existing language adaptation strategies to BLOOM and benchmark its zero-shot pro… ▽ More

    Submitted 27 May, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: ACL 2023

  20. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  21. arXiv:2208.10893  [pdf

    physics.ins-det cs.LG physics.data-an

    Transfer Learning Application of Self-supervised Learning in ARPES

    Authors: Sandy Adhitia Ekahana, Genta Indra Winata, Y. Soh, Gabriel Aeppli, Radovic Milan, Ming Shi

    Abstract: Recent development in angle-resolved photoemission spectroscopy (ARPES) technique involves spatially resolving samples while maintaining the high-resolution feature of momentum space. This development easily expands the data size and its complexity for data analysis, where one of it is to label similar dispersion cuts and map them spatially. In this work, we demonstrate that the recent development… ▽ More

    Submitted 23 August, 2022; originally announced August 2022.

  22. arXiv:2207.10524  [pdf, other

    cs.CL cs.AI

    NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages

    Authors: Samuel Cahyawijaya, Alham Fikri Aji, Holy Lovenia, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Fajri Koto, David Moeljadi, Karissa Vincentio, Ade Romadhony, Ayu Purwarianti

    Abstract: At the center of the underlying issues that halt Indonesian natural language processing (NLP) research advancement, we find data scarcity. Resources in Indonesian languages, especially the local ones, are extremely scarce and underrepresented. Many Indonesian researchers do not publish their dataset. Furthermore, the few public datasets that we have are scattered across different platforms, thus m… ▽ More

    Submitted 1 August, 2022; v1 submitted 21 July, 2022; originally announced July 2022.

  23. arXiv:2206.11249  [pdf, other

    cs.CL cs.AI cs.LG

    GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

    Authors: Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna Shvets, Ashish Upadhyay, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter , et al. (52 additional authors not shown)

    Abstract: Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, an… ▽ More

    Submitted 24 June, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

  24. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  25. arXiv:2205.15960  [pdf, other

    cs.CL

    NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

    Authors: Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, Sebastian Ruder

    Abstract: Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing re… ▽ More

    Submitted 12 April, 2023; v1 submitted 31 May, 2022; originally announced May 2022.

    Comments: EACL 2023

  26. arXiv:2203.13357  [pdf, other

    cs.CL

    One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

    Authors: Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Timothy Baldwin, Jey Han Lau, Sebastian Ruder

    Abstract: NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian N… ▽ More

    Submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted in ACL 2022

  27. arXiv:2201.08687  [pdf, other

    cs.CL

    A Comparative Study on Language Models for Task-Oriented Dialogue Systems

    Authors: Vinsen Marselino Andreas, Genta Indra Winata, Ayu Purwarianti

    Abstract: The recent development of language models has shown promising results by achieving state-of-the-art performance on various natural language tasks by fine-tuning pretrained models. In task-oriented dialogue (ToD) systems, language models can be used for end-to-end training without relying on dialogue state tracking to track the dialogue history but allowing the language models to generate responses… ▽ More

    Submitted 21 January, 2022; originally announced January 2022.

    Comments: 5 pages, 1 figure

    ACM Class: I.2.7

    Journal ref: 2021 8th International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA) (pp. 1-5). IEEE

  28. arXiv:2201.03804  [pdf, other

    cs.CL cs.AI

    CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition

    Authors: Wenliang Dai, Samuel Cahyawijaya, Tiezheng Yu, Elham J. Barezi, Peng Xu, Cheuk Tung Shadow Yiu, Rita Frieske, Holy Lovenia, Genta Indra Winata, Qifeng Chen, Xiaojuan Ma, Bertram E. Shi, Pascale Fung

    Abstract: With the rise of deep learning and intelligent vehicle, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities. In-car smart assistants should be able to process general as well as car-related commands and perform corresponding actions, which eases driving and improves safety. However, there is a data scarcity issue for low resource lan… ▽ More

    Submitted 14 March, 2022; v1 submitted 11 January, 2022; originally announced January 2022.

    Comments: 6 pages

  29. arXiv:2112.06223  [pdf, other

    cs.CL

    ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation

    Authors: Holy Lovenia, Samuel Cahyawijaya, Genta Indra Winata, Peng Xu, Xu Yan, Zihan Liu, Rita Frieske, Tiezheng Yu, Wenliang Dai, Elham J. Barezi, Qifeng Chen, Xiaojuan Ma, Bertram E. Shi, Pascale Fung

    Abstract: Code-switching is a speech phenomenon occurring when a speaker switches language during a conversation. Despite the spontaneous nature of code-switching in conversational spoken language, most existing works collect code-switching data from read speech instead of spontaneous speech. ASCEND (A Spontaneous Chinese-English Dataset) is a high-quality Mandarin Chinese-English code-switching corpus buil… ▽ More

    Submitted 3 May, 2022; v1 submitted 12 December, 2021; originally announced December 2021.

    Journal ref: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)

  30. arXiv:2112.02721  [pdf, other

    cs.CL cs.AI cs.LG

    NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

    Authors: Kaustubh D. Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish Shrivastava, Samson Tan, Tongshuang Wu, Jascha Sohl-Dickstein, Jinho D. Choi, Eduard Hovy, Ondrej Dusek, Sebastian Ruder, Sajant Anand, Nagender Aneja, Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle, Caroline Brun, Marco Antonio Sobrevilla Cabezudo , et al. (101 additional authors not shown)

    Abstract: Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data split… ▽ More

    Submitted 11 October, 2022; v1 submitted 5 December, 2021; originally announced December 2021.

    Comments: 39 pages, repository at https://github.com/GEM-benchmark/NL-Augmenter

  31. arXiv:2110.08118  [pdf, other

    cs.CL cs.AI

    Few-Shot Bot: Prompt-Based Learning for Dialogue Systems

    Authors: Andrea Madotto, Zhaojiang Lin, Genta Indra Winata, Pascale Fung

    Abstract: Learning to converse using only a few examples is a great challenge in conversational AI. The current best conversational models, which are either good chit-chatters (e.g., BlenderBot) or goal-oriented systems (e.g., MinTL), are language models (LMs) fine-tuned on large conversational datasets. Training these models is expensive, both in terms of computational resources and time, and it is hard to… ▽ More

    Submitted 15 October, 2021; originally announced October 2021.

  32. arXiv:2109.07684  [pdf, other

    cs.CL cs.AI

    Language Models are Few-shot Multilingual Learners

    Authors: Genta Indra Winata, Andrea Madotto, Zhaojiang Lin, Rosanne Liu, Jason Yosinski, Pascale Fung

    Abstract: General-purpose language models have demonstrated impressive capabilities, performing on par with state-of-the-art approaches on a range of downstream natural language processing (NLP) tasks and benchmarks when inferring instructions from very few examples. Here, we evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages without a… ▽ More

    Submitted 15 September, 2021; originally announced September 2021.

    Comments: 14 pages

  33. arXiv:2109.06762  [pdf, other

    cs.LG cs.AI

    Greenformer: Factorization Toolkit for Efficient Deep Neural Networks

    Authors: Samuel Cahyawijaya, Genta Indra Winata, Holy Lovenia, Bryan Wilie, Wenliang Dai, Etsuko Ishii, Pascale Fung

    Abstract: While the recent advances in deep neural networks (DNN) bring remarkable success, the computational cost also increases considerably. In this paper, we introduce Greenformer, a toolkit to accelerate the computation of neural networks through matrix factorization while maintaining performance. Greenformer can be easily applied with a single line of code to any DNN model. Our experimental results sh… ▽ More

    Submitted 9 October, 2021; v1 submitted 14 September, 2021; originally announced September 2021.

  34. arXiv:2106.03777  [pdf, other

    cs.CL cs.AI

    X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented Compositional Semantic Parsing

    Authors: Zihan Liu, Genta Indra Winata, Peng Xu, Pascale Fung

    Abstract: Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries and serves as an essential component of virtual assistants. Current TCSP models rely on numerous training data to achieve decent performance but fail to generalize to low-resource target languages or domains. In this paper, we present X2Parser, a transferable Cross-lingual and Cross-domain Parser for TCSP. Unli… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted in RepL4NLP 2021

  35. arXiv:2106.03530  [pdf, ps, other

    cs.CL cs.AI

    CAiRE in DialDoc21: Data Augmentation for Information-Seeking Dialogue System

    Authors: Etsuko Ishii, Yan Xu, Genta Indra Winata, Zhaojiang Lin, Andrea Madotto, Zihan Liu, Peng Xu, Pascale Fung

    Abstract: Information-seeking dialogue systems, including knowledge identification and response generation, aim to respond to users with fluent, coherent, and informative responses based on users' needs, which. To tackle this challenge, we utilize data augmentation methods and several training techniques with the pre-trained language models to learn a general pattern of the task and thus achieve promising p… ▽ More

    Submitted 7 June, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted in DialDoc21 Workshop in ACL 2021. Etsuko Ishii and Yan Xu contributed equally to this work

  36. arXiv:2106.02787  [pdf, other

    cs.CL

    BiToD: A Bilingual Multi-Domain Dataset For Task-Oriented Dialogue Modeling

    Authors: Zhaojiang Lin, Andrea Madotto, Genta Indra Winata, Peng Xu, Feijun Jiang, Yuxiang Hu, Chen Shi, Pascale Fung

    Abstract: Task-oriented dialogue (ToD) benchmarks provide an important avenue to measure progress and develop better conversational agents. However, existing datasets for end-to-end ToD modeling are limited to a single language, hindering the development of robust end-to-end ToD systems for multilingual countries and regions. Here we introduce BiToD, the first bilingual multi-domain dataset for end-to-end t… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

    Comments: 22 pages

  37. arXiv:2106.02325  [pdf, other

    cs.CL cs.HC

    ERICA: An Empathetic Android Companion for Covid-19 Quarantine

    Authors: Etsuko Ishii, Genta Indra Winata, Samuel Cahyawijaya, Divesh Lala, Tatsuya Kawahara, Pascale Fung

    Abstract: Over the past year, research in various domains, including Natural Language Processing (NLP), has been accelerated to fight against the COVID-19 pandemic, yet such research has just started on dialogue systems. In this paper, we introduce an end-to-end dialogue system which aims to ease the isolation of people under self-quarantine. We conduct a control simulation experiment to assess the effects… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

    Comments: Accepted in SIGDIAL 2021

  38. arXiv:2106.00410  [pdf, other

    cs.CL cs.HC cs.SD eess.AS

    Nora: The Well-Being Coach

    Authors: Genta Indra Winata, Holy Lovenia, Etsuko Ishii, Farhad Bin Siddique, Yongsheng Yang, Pascale Fung

    Abstract: The current pandemic has forced people globally to remain in isolation and practice social distancing, which creates the need for a system to combat the resulting loneliness and negative emotions. In this paper we propose Nora, a virtual coaching platform designed to utilize natural language understanding in its dialogue system and suggest other recommendations based on user interactions. It is in… ▽ More

    Submitted 1 June, 2021; originally announced June 2021.

    Comments: 7 pages

  39. arXiv:2105.06232  [pdf, other

    cs.CL cs.AI

    Retrieval-Free Knowledge-Grounded Dialogue Response Generation with Adapters

    Authors: Yan Xu, Etsuko Ishii, Samuel Cahyawijaya, Zihan Liu, Genta Indra Winata, Andrea Madotto, Dan Su, Pascale Fung

    Abstract: To diversify and enrich generated dialogue responses, knowledge-grounded dialogue has been investigated in recent years. The existing methods tackle the knowledge grounding challenge by retrieving the relevant sentences over a large corpus and augmenting the dialogues with explicit extra information. Despite their success, however, the existing works have drawbacks in inference efficiency. This pa… ▽ More

    Submitted 25 April, 2022; v1 submitted 13 May, 2021; originally announced May 2021.

    Comments: The first two authors contribute equally; Accepted in ACL 2022 DialDoc Workshop (Best Student Paper Award)

  40. arXiv:2105.03953  [pdf, other

    cs.CL cs.AI

    Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation

    Authors: Zihan Liu, Genta Indra Winata, Pascale Fung

    Abstract: The data scarcity in low-resource languages has become a bottleneck to building robust neural machine translation systems. Fine-tuning a multilingual pre-trained model (e.g., mBART (Liu et al., 2020)) on the translation task is a good approach for low-resource languages; however, its performance will be greatly limited when there are unseen languages in the translation pairs. In this paper, we pre… ▽ More

    Submitted 9 May, 2021; originally announced May 2021.

    Comments: Accepted in Findings of ACL 2021

  41. arXiv:2104.08200  [pdf, other

    cs.CL

    IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

    Authors: Samuel Cahyawijaya, Genta Indra Winata, Bryan Wilie, Karissa Vincentio, Xiaohong Li, Adhiguna Kuncoro, Sebastian Ruder, Zhi Yuan Lim, Syafri Bahar, Masayu Leylia Khodra, Ayu Purwarianti, Pascale Fung

    Abstract: Natural language generation (NLG) benchmarks provide an important avenue to measure progress and develop better NLG systems. Unfortunately, the lack of publicly available NLG benchmarks for low-resource languages poses a challenging barrier for building NLG systems that work well for languages with limited amounts of data. Here we introduce IndoNLG, the first benchmark to measure natural language… ▽ More

    Submitted 9 October, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

    Comments: Accepted in EMNLP 2021, 10 pages

  42. arXiv:2104.06268  [pdf, other

    cs.CL cs.LG eess.AS

    Multilingual Transfer Learning for Code-Switched Language and Speech Neural Modeling

    Authors: Genta Indra Winata

    Abstract: In this thesis, we address the data scarcity and limitations of linguistic theory by proposing language-agnostic multi-task training methods. First, we introduce a meta-learning-based approach, meta-transfer learning, in which information is judiciously extracted from high-resource monolingual speech data to the code-switching domain. The meta-transfer learning quickly adapts the model to the code… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

    Comments: HKUST PhD Thesis. 120 pages

  43. arXiv:2103.13309  [pdf, other

    cs.CL cs.LG

    Are Multilingual Models Effective in Code-Switching?

    Authors: Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhaojiang Lin, Andrea Madotto, Pascale Fung

    Abstract: Multilingual language models have shown decent performance in multilingual and cross-lingual natural language understanding tasks. However, the power of these multilingual models in code-switching tasks has not been fully explored. In this paper, we study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting by considering t… ▽ More

    Submitted 24 March, 2021; originally announced March 2021.

  44. arXiv:2012.01687  [pdf, other

    cs.CL cs.AI cs.LG

    Adapt-and-Adjust: Overcoming the Long-Tail Problem of Multilingual Speech Recognition

    Authors: Genta Indra Winata, Guangsen Wang, Caiming Xiong, Steven Hoi

    Abstract: One crucial challenge of real-world multilingual speech recognition is the long-tailed distribution problem, where some resource-rich languages like English have abundant training data, but a long tail of low-resource languages have varying amounts of limited training data. To overcome the long-tail problem, in this paper, we propose Adapt-and-Adjust (A2), a transformer-based multi-task learning f… ▽ More

    Submitted 2 December, 2020; originally announced December 2020.

  45. arXiv:2009.14510  [pdf, other

    cs.CL cs.LG

    Cross-lingual Spoken Language Understanding with Regularized Representation Alignment

    Authors: Zihan Liu, Genta Indra Winata, Peng Xu, Zhaojiang Lin, Pascale Fung

    Abstract: Despite the promising results of current cross-lingual models for spoken language understanding systems, they still suffer from imperfect cross-lingual representation alignments between the source and target languages, which makes the performance sub-optimal. To cope with this issue, we propose a regularization approach to further align word-level and sentence-level representations across language… ▽ More

    Submitted 30 September, 2020; originally announced September 2020.

    Comments: EMNLP-2020 Long Paper

  46. arXiv:2009.13656  [pdf, other

    cs.CL cs.AI

    Learning Knowledge Bases with Parameters for Task-Oriented Dialogue Systems

    Authors: Andrea Madotto, Samuel Cahyawijaya, Genta Indra Winata, Yan Xu, Zihan Liu, Zhaojiang Lin, Pascale Fung

    Abstract: Task-oriented dialogue systems are either modularized with separate dialogue state tracking (DST) and management steps or end-to-end trainable. In either case, the knowledge base (KB) plays an essential role in fulfilling user requests. Modularized systems rely on DST to interact with the KB, which is expensive in terms of annotation and inference time. End-to-end systems use the KB directly as in… ▽ More

    Submitted 28 September, 2020; originally announced September 2020.

    Comments: Accepted EMNLP findings

  47. arXiv:2009.12005  [pdf, other

    cs.CL cs.AI

    MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems

    Authors: Zhaojiang Lin, Andrea Madotto, Genta Indra Winata, Pascale Fung

    Abstract: In this paper, we propose Minimalist Transfer Learning (MinTL) to simplify the system design process of task-oriented dialogue systems and alleviate the over-dependency on annotated data. MinTL is a simple yet effective transfer learning framework, which allows us to plug-and-play pre-trained seq2seq models, and jointly learn dialogue state tracking and dialogue response generation. Unlike previou… ▽ More

    Submitted 28 September, 2020; v1 submitted 24 September, 2020; originally announced September 2020.

    Comments: EMNLP 2020 camera ready

  48. arXiv:2009.05387  [pdf, other

    cs.CL

    IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

    Authors: Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, Ayu Purwarianti

    Abstract: Although Indonesian is known to be the fourth most frequently used language over the internet, the research progress on this language in the natural language processing (NLP) is slow-moving due to a lack of available resources. In response, we introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks. IndoNLU… ▽ More

    Submitted 8 October, 2020; v1 submitted 11 September, 2020; originally announced September 2020.

    Comments: This paper will be presented in AACL-IJCNLP 2020 (with new results and acknowledgment)

  49. arXiv:2008.09378  [pdf, other

    cs.CL cs.LG

    EmoGraph: Capturing Emotion Correlations using Graph Networks

    Authors: Peng Xu, Zihan Liu, Genta Indra Winata, Zhaojiang Lin, Pascale Fung

    Abstract: Most emotion recognition methods tackle the emotion understanding task by considering individual emotion independently while ignoring their fuzziness nature and the interconnections among them. In this paper, we explore how emotion correlations can be captured and help different classification tasks. We propose EmoGraph that captures the dependencies among different emotions through graph networks… ▽ More

    Submitted 21 August, 2020; originally announced August 2020.

    Comments: The first two authors contributed equally

  50. arXiv:2004.14228  [pdf, other

    cs.CL cs.SD eess.AS

    Meta-Transfer Learning for Code-Switched Speech Recognition

    Authors: Genta Indra Winata, Samuel Cahyawijaya, Zhaojiang Lin, Zihan Liu, Peng Xu, Pascale Fung

    Abstract: An increasing number of people in the world today speak a mixed-language as a result of being multilingual. However, building a speech recognition system for code-switching remains difficult due to the availability of limited resources and the expense and significant effort required to collect mixed-language data. We therefore propose a new learning method, meta-transfer learning, to transfer lear… ▽ More

    Submitted 29 April, 2020; originally announced April 2020.

    Comments: Accepted in ACL 2020. The first two authors contributed equally to this work