(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 300 results for author: Neubig, G

.
  1. arXiv:2407.16741  [pdf, other

    cs.SE cs.AI cs.CL

    OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

    Authors: Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, Graham Neubig

    Abstract: Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenD… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

    Comments: Code: https://github.com/OpenDevin/OpenDevin

  2. arXiv:2407.12874  [pdf, other

    cs.CL cs.AI

    SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning

    Authors: Chenyang Zhao, Xueying Jia, Vijay Viswanathan, Tongshuang Wu, Graham Neubig

    Abstract: Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. However, prompting often leads models to make predictions with lower accuracy compared to finetuning a model with ample training data. On the other hand, while finetuning LLMs on task-specific data generally improves their performance, abundant annotated datasets are not… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Accepted by COLM 2024

  3. arXiv:2407.06304  [pdf, other

    cs.CV cs.AI cs.CL

    VIMI: Grounding Video Generation through Multi-modal Instruction

    Authors: Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai-Shien Chen, Kuan-Chien Wang, Ivan Skorokhodov, Graham Neubig, Sergey Tulyakov

    Abstract: Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. This limitation stems from the absence of large-scale multimodal prompt video datasets, resulting in a lack of visual grounding and restricting their versatility and application in multimodal integration. To address this, we construct a large-scale multimodal prompt dataset by employing retrieval metho… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  4. arXiv:2407.05463  [pdf, other

    cs.CL

    Training Task Experts through Retrieval Based Distillation

    Authors: Jiaxin Ge, Xueying Jia, Vijay Viswanathan, Hongyin Luo, Graham Neubig

    Abstract: One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the qu… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  5. arXiv:2407.02233  [pdf, other

    cs.CL cs.AI cs.LG

    Synthetic Multimodal Question Generation

    Authors: Ian Wu, Sravan Jayanthi, Vijay Viswanathan, Simon Rosenberg, Sina Pakazad, Tongshuang Wu, Graham Neubig

    Abstract: Multimodal Retrieval Augmented Generation (MMRAG) is a powerful approach to question-answering over multimodal documents. A key challenge with evaluating MMRAG is the paucity of high-quality datasets matching the question styles and modalities of interest. In light of this, we propose SMMQG, a synthetic data generation framework. SMMQG leverages interplay between a retriever, large language model… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Submitted to ARR June 2024

  6. arXiv:2406.16838  [pdf, other

    cs.CL cs.LG

    From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

    Authors: Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui

    Abstract: One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during inference. This survey focuses on these inference-time approaches. We explore three areas under a unified mathematical formalism: token-level generation algorithms, m… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  7. arXiv:2406.14497  [pdf, other

    cs.SE cs.CL

    CodeRAG-Bench: Can Retrieval Augment Code Generation?

    Authors: Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, Daniel Fried

    Abstract: While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  8. arXiv:2406.13743  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

    Authors: Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, Deva Ramanan

    Abstract: While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-vis… ▽ More

    Submitted 21 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: We open-source our dataset, model, and code at: https://linzhiqiu.github.io/papers/genai_bench ; Project page: https://linzhiqiu.github.io/papers/genai_bench ; GenAI-Bench was first introduced in arxiv:2404.01291. This article extends it with an additional GenAI-Rank benchmark.

  9. arXiv:2406.11830  [pdf, other

    cs.CL cs.AI

    Language Modeling with Editable External Knowledge

    Authors: Belinda Z. Li, Emmy Liu, Alexis Ross, Abbas Zeitoun, Graham Neubig, Jacob Andreas

    Abstract: When the world changes, so does the text that humans write about it. How do we build language models that can be easily updated to reflect these changes? One popular approach is retrieval-augmented generation, in which new documents are inserted into a knowledge base and retrieved during prediction for downstream tasks. Most prior work on these systems have focused on improving behavior during pre… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  10. arXiv:2406.06565  [pdf, other

    cs.CL cs.AI cs.LG

    MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

    Authors: Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You

    Abstract: Evaluating large language models (LLMs) is challenging. Traditional ground-truth-based benchmarks fail to capture the comprehensiveness and nuance of real-world queries, while LLM-as-judge benchmarks suffer from grading biases and limited query quantity. Both of them may also become contaminated over time. User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and s… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  11. arXiv:2406.05761  [pdf, other

    cs.CL

    The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

    Authors: Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang , et al. (7 additional authors not shown)

    Abstract: As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on spec… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Work in Progress

  12. arXiv:2405.01535  [pdf, other

    cs.CL

    Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

    Authors: Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo

    Abstract: Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those ass… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: Work in Progress

  13. arXiv:2405.00200  [pdf, other

    cs.CL

    In-Context Learning with Long-Context Models: An In-Depth Exploration

    Authors: Amanda Bertsch, Maor Ivgi, Uri Alon, Jonathan Berant, Matthew R. Gormley, Graham Neubig

    Abstract: As model context lengths continue to increase, the number of demonstrations that can be provided in-context approaches the size of entire training datasets. We study the behavior of in-context learning (ICL) at this extreme scale on multiple datasets and models. We show that, for many datasets with large label spaces, performance continues to increase with hundreds or thousands of demonstrations.… ▽ More

    Submitted 30 April, 2024; originally announced May 2024.

    Comments: 27 pages; preprint

  14. arXiv:2404.14361  [pdf, other

    cs.CL

    Better Synthetic Data by Retrieving and Transforming Existing Datasets

    Authors: Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, Graham Neubig

    Abstract: Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets… ▽ More

    Submitted 26 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: PDF fixed in v3

  15. arXiv:2404.05955  [pdf, other

    cs.CL cs.AI

    VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

    Authors: Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue

    Abstract: Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. Existing benchmarks are either designed for general multimodal tasks, failing to capture the unique characteristics of web pages, or focus on end-to-end web agent tasks, unable to measure fine-grained a… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  16. arXiv:2404.03028  [pdf, other

    cs.CL

    An Incomplete Loop: Deductive, Inductive, and Abductive Learning in Large Language Models

    Authors: Emmy Liu, Graham Neubig, Jacob Andreas

    Abstract: Modern language models (LMs) can learn to perform new tasks in different ways: in instruction following, the target task is described explicitly in natural language; in few-shot prompting, the task is specified implicitly with a small number of examples; in instruction inference, LMs are presented with in-context examples and are then prompted to generate a natural language task description before… ▽ More

    Submitted 10 April, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

  17. arXiv:2404.02408  [pdf, other

    cs.CL

    CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models

    Authors: Zaid Sheikh, Antonios Anastasopoulos, Shruti Rijhwani, Lindia Tjuatja, Robbie Jimerson, Graham Neubig

    Abstract: Effectively using Natural Language Processing (NLP) tools in under-resourced languages requires a thorough understanding of the language itself, familiarity with the latest models and training methodologies, and technical expertise to deploy these models. This could present a significant obstacle for language community members and linguists to use NLP tools. This paper introduces the CMU Linguisti… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: Live demo at https://cmulab.dev

  18. arXiv:2404.01291  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    Evaluating Text-to-Visual Generation with Image-to-Text Generation

    Authors: Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan

    Abstract: Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reas… ▽ More

    Submitted 18 June, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

    Comments: We open-source our data, model, and code at: https://github.com/linzhiqiu/t2v_metrics ; Project page: https://linzhiqiu.github.io/papers/vqascore

  19. arXiv:2404.01247  [pdf, other

    cs.CL cs.CV

    An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance

    Authors: Simran Khanuja, Sathyanarayanan Ramamoorthy, Yueqi Song, Graham Neubig

    Abstract: Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we take a first step towards translating images to make them… ▽ More

    Submitted 19 June, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

  20. arXiv:2403.15452  [pdf, other

    cs.CL cs.AI

    What Are Tools Anyway? A Survey from the Language Model Perspective

    Authors: Zhiruo Wang, Zhoujun Cheng, Hao Zhu, Daniel Fried, Graham Neubig

    Abstract: Language models (LMs) are powerful yet mostly for text generation tasks. Tools have substantially enhanced their performance for tasks that require complex skills. However, many works adopt the term "tool" in different ways, raising the question: What is a tool anyway? Subsequently, where and how do tools help LMs? In this survey, we provide a unified definition of tools as external programs used… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  21. arXiv:2403.13169  [pdf, other

    cs.CL

    Wav2Gloss: Generating Interlinear Glossed Text from Speech

    Authors: Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel R. Robinson, Jiatong Shi, Shinji Watanabe, Graham Neubig, David R. Mortensen, Lori Levin

    Abstract: Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity. Interlinear Glossed Text (IGT) is a form of linguistic annotation that can support documentation and resource creation for these languages' communities. IGT typically consists of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4) free transl… ▽ More

    Submitted 5 June, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: ACL 2024 camera ready version

  22. arXiv:2403.09040  [pdf, other

    cs.CL

    RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems

    Authors: Jennifer Hsia, Afreen Shaikh, Zhiruo Wang, Graham Neubig

    Abstract: Retrieval-augmented generation (RAG) greatly benefits language models (LMs) by providing additional context for tasks such as document-based question answering (DBQA). Despite its potential, the power of RAG is highly dependent on its configuration, raising the question: What is the optimal RAG configuration? To answer this, we introduce the RAGGED framework to analyze and optimize RAG systems. On… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

  23. arXiv:2403.08715  [pdf, other

    cs.CL

    SOTOPIA-$πぱい$: Interactive Learning of Socially Intelligent Language Agents

    Authors: Ruiyi Wang, Haofei Yu, Wenxin Zhang, Zhengyang Qi, Maarten Sap, Graham Neubig, Yonatan Bisk, Hao Zhu

    Abstract: Humans learn social skills through both imitation and social interaction. This social learning process is largely understudied by existing research on building language agents. Motivated by this gap, we propose an interactive learning method, SOTOPIA-$πぱい$, improving the social intelligence of language agents. This method leverages behavior cloning and self-reinforcement training on filtered social… ▽ More

    Submitted 25 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

  24. arXiv:2403.06399  [pdf, other

    cs.CL

    GlossLM: Multilingual Pretraining for Low-Resource Interlinear Glossing

    Authors: Michael Ginn, Lindia Tjuatja, Taiqi He, Enora Rice, Graham Neubig, Alexis Palmer, Lori Levin

    Abstract: Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few existing resources providing large amounts of standardized, easily accessible IGT data, limiting their applicability to linguistic research, and making it diffic… ▽ More

    Submitted 27 June, 2024; v1 submitted 10 March, 2024; originally announced March 2024.

    Comments: 19 pages, 7 figures Submitted to ACL ARR June 2024. First two authors are equal contribution

  25. arXiv:2403.01404  [pdf, other

    cs.CL

    What Is Missing in Multilingual Visual Reasoning and How to Fix It

    Authors: Yueqi Song, Simran Khanuja, Graham Neubig

    Abstract: NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users. In this paper, we evaluate their multilingual, multimodal capabilities by testing on a visual reasoning task. We observe that proprietary systems like GPT-4V obtain the best performance on this task now, but open models lag in comparison. Surprisingly, GPT-4V exhibits similar perfor… ▽ More

    Submitted 3 March, 2024; originally announced March 2024.

  26. arXiv:2402.15449  [pdf, other

    cs.CL cs.LG

    Repetition Improves Language Model Embeddings

    Authors: Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, Aditi Raghunathan

    Abstract: Recent approaches to improving the extraction of text embeddings from autoregressive large language models (LLMs) have largely focused on improvements to data, backbone pretrained language models, or improving task-differentiation via instructions. In this work, we address an architectural limitation of autoregressive models: token embeddings cannot contain information from tokens that appear late… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

    Comments: 36 pages, 11 figures, 16 tables

  27. arXiv:2402.12847  [pdf, other

    cs.CL cs.AI cs.LG

    Instruction-tuned Language Models are Better Knowledge Learners

    Authors: Zhengbao Jiang, Zhiqing Sun, Weijia Shi, Pedro Rodriguez, Chunting Zhou, Graham Neubig, Xi Victoria Lin, Wen-tau Yih, Srinivasan Iyer

    Abstract: In order for large language model (LLM)-based assistants to effectively adapt to evolving information needs, it must be possible to update their factual knowledge through continued training on new data. The standard recipe for doing so involves continued pre-training on new documents followed by instruction-tuning on question-answer (QA) pairs. However, we find that LLMs trained with this recipe s… ▽ More

    Submitted 25 May, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: ACL 2024. The reproduced data for this paper is available at https://github.com/Edward-Sun/PIT

  28. arXiv:2402.05406  [pdf, other

    cs.LG cs.CL

    Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes

    Authors: Lucio Dery, Steven Kolawole, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar

    Abstract: Given the generational gap in available hardware between lay practitioners and the most endowed institutions, LLMs are becoming increasingly inaccessible as they grow in size. Whilst many approaches have been proposed to compress LLMs to make their resource consumption manageable, these methods themselves tend to be resource intensive, putting them out of the reach of the very user groups they tar… ▽ More

    Submitted 9 February, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

    Comments: 15 pages, 4 fiigures, 15 tables

  29. arXiv:2401.16788  [pdf, other

    cs.CL cs.AI

    Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

    Authors: Steffi Chern, Ethan Chern, Graham Neubig, Pengfei Liu

    Abstract: Despite the utility of Large Language Models (LLMs) across a wide range of tasks and scenarios, developing a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs to assess responses generated by LLMs. However, the meta-evaluation conducted to assess the effectiveness of these LLMs as evaluators is typically constrained… ▽ More

    Submitted 30 January, 2024; originally announced January 2024.

  30. arXiv:2401.13649  [pdf, other

    cs.LG cs.CL cs.CV

    VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

    Authors: Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried

    Abstract: Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augmen… ▽ More

    Submitted 5 June, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted to ACL 2024. 24 pages. Project page: https://jykoh.com/vwa

  31. arXiv:2401.12869  [pdf, other

    cs.AI

    TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks

    Authors: Zhiruo Wang, Daniel Fried, Graham Neubig

    Abstract: Language models (LMs) can solve tasks such as answering questions about tables or images by writing programs. However, using primitive functions often leads to verbose and error-prone programs, and higher-level functions require expert design. To enable better solutions without human labor, we ask code LMs to curate reusable high-level functions, and use them to write solutions. We present TROVE,… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

  32. arXiv:2401.06855  [pdf, other

    cs.CL

    Fine-grained Hallucination Detection and Editing for Language Models

    Authors: Abhika Mishra, Akari Asai, Vidhisha Balachandran, Yizhong Wang, Graham Neubig, Yulia Tsvetkov, Hannaneh Hajishirzi

    Abstract: Large language models (LMs) are prone to generate factual errors, which are often called hallucinations. In this paper, we introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms, each requiring varying degrees of careful assessments to verify factuality. We propose a novel task of automatic fine-grained hallucination detection and construct a n… ▽ More

    Submitted 21 February, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

    Comments: Our code, data, and demo are available at https://fine-grained-hallucination.github.io. Expanded human annotations adding a new LM, as well as included more baselines for comparison

  33. arXiv:2312.11444  [pdf, other

    cs.CL cs.AI

    An In-depth Look at Gemini's Language Abilities

    Authors: Syeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex Bäuerle, Ángel Alexander Cabrera, Krish Dholakia, Chenyan Xiong, Graham Neubig

    Abstract: The recently released Google Gemini class of models are the first to comprehensively report results that rival the OpenAI GPT series across a wide variety of tasks. In this paper, we do an in-depth exploration of Gemini's language abilities, making two contributions. First, we provide a third-party, objective comparison of the abilities of the OpenAI GPT and Google Gemini models with reproducible… ▽ More

    Submitted 24 December, 2023; v1 submitted 18 December, 2023; originally announced December 2023.

  34. arXiv:2312.07000  [pdf, other

    cs.CL cs.AI

    Alignment for Honesty

    Authors: Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, Pengfei Liu

    Abstract: Recent research has made significant strides in applying alignment techniques to enhance the helpfulness and harmlessness of large language models (LLMs) in accordance with human intentions. In this paper, we argue for the importance of alignment for honesty, ensuring that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. However, a pi… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

  35. arXiv:2312.03151  [pdf, other

    cs.LG

    Multitask Learning Can Improve Worst-Group Outcomes

    Authors: Atharva Kulkarni, Lucio Dery, Amrith Setlur, Aditi Raghunathan, Ameet Talwalkar, Graham Neubig

    Abstract: In order to create machine learning systems that serve a variety of users well, it is vital to not only achieve high average performance but also ensure equitable outcomes across diverse groups. However, most machine learning methods are designed to improve a model's average performance on a chosen end task without consideration for their impact on worst group error. Multitask learning (MTL) is on… ▽ More

    Submitted 28 February, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: 20 pages, 7 tables, 6 Figures

  36. arXiv:2311.09553  [pdf, other

    cs.AI

    Program-Aided Reasoners (better) Know What They Know

    Authors: Anubha Kabra, Sanketh Rangreji, Yash Mathur, Aman Madaan, Emmy Liu, Graham Neubig

    Abstract: Prior work shows that program-aided reasoning, in which large language models (LLMs) are combined with programs written in programming languages such as Python, can significantly improve accuracy on various reasoning tasks. However, while accuracy is essential, it is also important for such reasoners to "know what they know", which can be quantified through the calibration of the model. In this pa… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

  37. arXiv:2311.09308  [pdf, other

    cs.CL cs.AI cs.LG q-bio.NC

    Divergences between Language Models and Human Brains

    Authors: Yuchen Zhou, Emmy Liu, Graham Neubig, Michael J. Tarr, Leila Wehbe

    Abstract: Do machines and humans process language in similar ways? Recent research has hinted in the affirmative, finding that brain signals can be effectively predicted using the internal representations of language models (LMs). Although such results are thought to reflect shared computational principles between LMs and human brains, there are also clear differences in how LMs and humans represent and use… ▽ More

    Submitted 4 February, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

  38. arXiv:2311.08377  [pdf, other

    cs.CL cs.AI

    Learning to Filter Context for Retrieval-Augmented Generation

    Authors: Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, Graham Neubig

    Abstract: On-the-fly retrieval of relevant knowledge has proven an essential element of reliable systems for tasks such as open-domain question answering and fact verification. However, because retrieval systems are not perfect, generation models are required to generate outputs given partially or entirely irrelevant passages. This can cause over- or under-reliance on context, and result in problems in the… ▽ More

    Submitted 14 November, 2023; originally announced November 2023.

  39. arXiv:2311.06379  [pdf, other

    cs.CL

    DeMuX: Data-efficient Multilingual Learning

    Authors: Simran Khanuja, Srinivas Gowriraj, Lucio Dery, Graham Neubig

    Abstract: We consider the task of optimally fine-tuning pre-trained multilingual models, given small amounts of unlabelled target data and an annotation budget. In this paper, we introduce DEMUX, a framework that prescribes the exact data-points to label from vast amounts of unlabelled multilingual data, having unknown degrees of overlap with the target set. Unlike most prior works, our end-to-end framework… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

  40. arXiv:2311.04076  [pdf, other

    cs.CL

    Do LLMs exhibit human-like response biases? A case study in survey design

    Authors: Lindia Tjuatja, Valerie Chen, Sherry Tongshuang Wu, Ameet Talwalkar, Graham Neubig

    Abstract: As large language models (LLMs) become more capable, there is growing excitement about the possibility of using LLMs as proxies for humans in real-world tasks where subjective labels are desired, such as in surveys and opinion polling. One widely-cited barrier to the adoption of LLMs as proxies for humans in subjective tasks is their sensitivity to prompt wording - but interestingly, humans also d… ▽ More

    Submitted 5 February, 2024; v1 submitted 7 November, 2023; originally announced November 2023.

  41. arXiv:2310.18417  [pdf, other

    cs.CL

    Teacher Perception of Automatically Extracted Grammar Concepts for L2 Language Learning

    Authors: Aditi Chaudhary, Arun Sampath, Ashwin Sheshadri, Antonios Anastasopoulos, Graham Neubig

    Abstract: One of the challenges in language teaching is how best to organize rules regarding syntax, semantics, or phonology in a meaningful manner. This not only requires content creators to have pedagogical skills, but also have that language's deep understanding. While comprehensive materials to develop such curricula are available in English and some broadly spoken languages, for many other languages, t… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Comments: Accepted at EMNLP Findings 2023. arXiv admin note: substantial text overlap with arXiv:2206.05154

  42. arXiv:2310.11667  [pdf, other

    cs.AI cs.CL cs.LG

    SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

    Authors: Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, Maarten Sap

    Abstract: Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide va… ▽ More

    Submitted 22 March, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: Preprint, 43 pages. The first two authors contribute equally

  43. arXiv:2310.07081  [pdf, other

    cs.CL

    Crossing the Threshold: Idiomatic Machine Translation through Retrieval Augmentation and Loss Weighting

    Authors: Emmy Liu, Aditi Chaudhary, Graham Neubig

    Abstract: Idioms are common in everyday language, but often pose a challenge to translators because their meanings do not follow from the meanings of their parts. Despite significant advances, machine translation systems still struggle to translate idiomatic expressions. We provide a simple characterization of idiomatic translation and related issues. This allows us to conduct a synthetic experiment reveali… ▽ More

    Submitted 20 October, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023

  44. arXiv:2310.01387  [pdf, other

    cs.CL

    It's MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk

    Authors: Amanda Bertsch, Alex Xie, Graham Neubig, Matthew R. Gormley

    Abstract: Minimum Bayes Risk (MBR) decoding is a method for choosing the outputs of a machine learning system based not on the output with the highest probability, but the output with the lowest risk (expected error) among multiple candidates. It is a simple but powerful method: for an additional cost at inference time, MBR provides reliable several-point improvements across metrics for a wide variety of ta… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

    Comments: Under submission

  45. arXiv:2309.07423  [pdf, other

    cs.CL

    ChatGPT MT: Competitive for High- (but not Low-) Resource Languages

    Authors: Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, Graham Neubig

    Abstract: Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT). Previous studies explore aspects of LLMs' MT capabilities. However, there exist a wide variety of languages for which recent LLM MT performance has never before been evaluated. Without published experimental evidence on the matter, it is difficult for speakers of the world's dive… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: 27 pages, 9 figures, 14 tables

  46. arXiv:2308.12261  [pdf, other

    cs.CL

    Prompt2Model: Generating Deployable Models from Natural Language Instructions

    Authors: Vijay Viswanathan, Chenyang Zhao, Amanda Bertsch, Tongshuang Wu, Graham Neubig

    Abstract: Large language models (LLMs) enable system builders today to create competent NLP systems through prompting, where they only need to describe the task in natural language and provide a few examples. However, in other ways, LLMs are a step backward from traditional special-purpose NLP models; they require extensive computational resources for deployment and can be gated behind APIs. In this paper,… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

    Comments: 8 pages

  47. arXiv:2308.07286  [pdf, other

    cs.CL cs.LG

    The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

    Authors: Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André F. T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, Orhan Firat

    Abstract: Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by pro… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: 19 pages

  48. arXiv:2307.13854  [pdf, other

    cs.AI cs.CL cs.LG

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Authors: Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig

    Abstract: With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, w… ▽ More

    Submitted 16 April, 2024; v1 submitted 25 July, 2023; originally announced July 2023.

    Comments: Our code, data, environment reproduction resources, and video demonstrations are publicly available at https://webarena.dev/

  49. arXiv:2307.13528  [pdf, other

    cs.CL cs.AI

    FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios

    Authors: I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei Liu

    Abstract: The emergence of generative pre-trained models has facilitated the synthesis of high-quality text, but it has also posed challenges in identifying factual errors in the generated text. In particular: (1) A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models. (2) Generated texts tend to be lengthy and lack a clearly defined granularity for… ▽ More

    Submitted 26 July, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

  50. arXiv:2307.04507  [pdf, other

    cs.CL cs.AI

    Improving Factuality of Abstractive Summarization via Contrastive Reward Learning

    Authors: I-Chun Chern, Zhiruo Wang, Sanjan Das, Bhavuk Sharma, Pengfei Liu, Graham Neubig

    Abstract: Modern abstractive summarization models often generate summaries that contain hallucinated or contradictory information. In this paper, we propose a simple but effective contrastive learning framework that incorporates recent developments in reward learning and factuality metrics. Empirical studies demonstrate that the proposed framework enables summarization models to learn from feedback of factu… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

    Comments: TrustNLP @ ACL 2023