(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 67 results for author: Boyd-Graber, J

.
  1. arXiv:2406.16342  [pdf, other

    cs.CL

    ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks

    Authors: Yoo Yeon Sung, Eve Fleisig, Ishani Mondal, Jordan Lee Boyd-Graber

    Abstract: Adversarial benchmarks validate model abilities by providing samples that fool models but not humans. However, despite the proliferation of datasets that claim to be adversarial, there does not exist an established metric to evaluate how adversarial these datasets are. To address this lacuna, we introduce ADVSCORE, a metric which quantifies how adversarial and discriminative an adversarial dataset… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2401.11185

  2. arXiv:2406.15352  [pdf, other

    cs.CL

    A SMART Mnemonic Sounds like "Glue Tonic": Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick

    Authors: Nishant Balepur, Matthew Shu, Alexander Hoyle, Alison Robey, Shi Feng, Seraphina Goldfarb-Tarrant, Jordan Boyd-Graber

    Abstract: Keyword mnemonics are memorable explanations that link new terms to simpler keywords. Prior works generate mnemonics for students, but they do not guide models toward mnemonics students prefer and aid learning. We build SMART, a mnemonic generator trained on feedback from real students learning new terms. To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics. We the… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: In-Progress Preprint

  3. arXiv:2406.10900  [pdf, other

    cs.CV cs.CL

    AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

    Authors: Xiyang Wu, Tianrui Guan, Dianqi Li, Shuaiyi Huang, Xiaoyu Liu, Xijun Wang, Ruiqi Xian, Abhinav Shrivastava, Furong Huang, Jordan Lee Boyd-Graber, Tianyi Zhou, Dinesh Manocha

    Abstract: Large vision-language models (LVLMs) hallucinate: certain context cues in an image may trigger the language module's overconfident and incorrect reasoning on abnormal or hypothetical objects. Though a few benchmarks have been developed to investigate LVLM hallucinations, they mainly rely on hand-crafted corner cases whose fail patterns may hardly generalize, and finetuning on them could undermine… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  4. arXiv:2406.04643  [pdf, other

    cs.CL

    More Victories, Less Cooperation: Assessing Cicero's Diplomacy Play

    Authors: Wichayaporn Wongkamjan, Feng Gu, Yanze Wang, Ulf Hermjakob, Jonathan May, Brandon M. Stewart, Jonathan K. Kummerfeld, Denis Peskoff, Jordan Lee Boyd-Graber

    Abstract: The boardgame Diplomacy is a challenging setting for communicative and cooperative artificial intelligence. The most prominent communicative Diplomacy AI, Cicero, has excellent strategic abilities, exceeding human players. However, the best Diplomacy players master communication, not just tactics, which is why the game has received attention as an AI challenge. This work seeks to understand the de… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  5. arXiv:2402.12291  [pdf, other

    cs.CL

    KARL: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students

    Authors: Matthew Shu, Nishant Balepur, Shi Feng, Jordan Boyd-Graber

    Abstract: Flashcard schedulers are tools that rely on 1) student models to predict the flashcards a student knows; and 2) teaching policies to schedule cards based on these predictions. Existing student models, however, only use flashcard-level features, like the student's past responses, ignoring the semantic ties of flashcards. Deep Knowledge Tracing (DKT) models can capture semantic relations with langua… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

    Comments: In-progress preprint

  6. arXiv:2402.11161  [pdf, other

    cs.CL cs.AI

    PEDANTS (Precise Evaluations of Diverse Answer Nominee Text for Skinflints): Efficient Evaluation Analysis and Benchmarking for Open-Domain Question Answering

    Authors: Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Lee Boyd-Graber

    Abstract: Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current efficient answer correctness (AC) metrics do not align with human judgments, particularly verbose, free-form answers from large language models (LLMs). There are two challenges: a lack of diverse evaluation data and that models are too big and… ▽ More

    Submitted 6 July, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: Efficient PEDANTS Classifier for short-form QA in github: https://github.com/zli12321/qa_metrics. arXiv admin note: text overlap with arXiv:2401.13170

  7. arXiv:2401.16348  [pdf, other

    cs.CL cs.CY cs.HC

    Improving the TENOR of Labeling: Re-evaluating Topic Models for Content Analysis

    Authors: Zongxia Li, Andrew Mao, Daniel Stephens, Pranav Goel, Emily Walpole, Alden Dima, Juan Fung, Jordan Boyd-Graber

    Abstract: Topic models are a popular tool for understanding text collections, but their evaluation has been a point of contention. Automated evaluation metrics such as coherence are often used, however, their validity has been questioned for neural topic models (NTMs) and can overlook a models benefits in real world applications. To this end, we conduct the first evaluation of neural, supervised and classic… ▽ More

    Submitted 19 February, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: 19 pages, 5 tables, 6 figures, Accepted to EACL Main Conference 2024

  8. arXiv:2401.13170   

    cs.CL

    CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering

    Authors: Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Boyd-Graber

    Abstract: Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments, particularly more verbose, free-form answers from large language models (LLM). There are two challenges: a lack of data and that models are too bi… ▽ More

    Submitted 29 June, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

    Comments: A duplicate and polished version is in arXiv:2402.11161

  9. arXiv:2401.11185  [pdf, other

    cs.CL cs.HC

    How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Question Generation

    Authors: Yoo Yeon Sung, Ishani Mondal, Jordan Boyd-Graber

    Abstract: Dynamic adversarial question generation, where humans write examples to stump a model, aims to create examples that are realistic and informative. However, the advent of large language models (LLMs) has been a double-edged sword for human authors: more people are interested in seeing and pushing the limits of these models, but because the models are so much stronger an opponent, they are harder to… ▽ More

    Submitted 20 January, 2024; originally announced January 2024.

  10. arXiv:2312.01308  [pdf, other

    cs.CL

    Bridging Background Knowledge Gaps in Translation with Automatic Explicitation

    Authors: HyoJung Han, Jordan Lee Boyd-Graber, Marine Carpuat

    Abstract: Translations help people understand content written in another language. However, even correct literal translations do not fulfill that goal when people lack the necessary background to understand them. Professional translators incorporate explicitations to explain the missing context by considering cultural differences between source and target audiences. Despite its potential to help users, NLP… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

    Comments: EMNLP2023

  11. arXiv:2311.16119  [pdf, other

    cs.CR cs.AI cs.CL

    Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition

    Authors: Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, Christopher Carnahan, Jordan Boyd-Graber

    Abstract: Large Language Models (LLMs) are deployed in interactive contexts with direct user engagement, such as chatbots and writing assistants. These deployments are vulnerable to prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and follow potentially malicious ones. Although widely acknowledged as a significant securit… ▽ More

    Submitted 2 March, 2024; v1 submitted 24 October, 2023; originally announced November 2023.

    Comments: 34 pages, 8 figures Codebase: https://github.com/PromptLabs/hackaprompt Dataset: https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset/blob/main/README.md Playground: https://huggingface.co/spaces/hackaprompt/playground

  12. arXiv:2311.09542  [pdf, other

    cs.CL

    Pregnant Questions: The Importance of Pragmatic Awareness in Maternal Health Question Answering

    Authors: Neha Srikanth, Rupak Sarkar, Heran Mane, Elizabeth M. Aparicio, Quynh C. Nguyen, Rachel Rudinger, Jordan Boyd-Graber

    Abstract: Questions posed by information-seeking users often contain implicit false or potentially harmful assumptions. In a high-risk domain such as maternal and infant health, a question-answering system must recognize these pragmatic constraints and go beyond simply answering user questions, examining them in context to respond helpfully. To achieve this, we study assumptions and implications, or pragmat… ▽ More

    Submitted 2 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: Accepted to NAACL 2024

  13. arXiv:2311.09438  [pdf, other

    cs.LG cs.CL cs.HC cs.IR

    Labeled Interactive Topic Models

    Authors: Kyle Seelman, Mozhi Zhang, Jordan Boyd-Graber

    Abstract: Topic models are valuable for understanding extensive document collections, but they don't always identify the most relevant topics. Classical probabilistic and anchor-based topic models offer interactive versions that allow users to guide the models towards more pertinent topics. However, such interactive features have been lacking in neural topic models. To correct this lacuna, we introduce a us… ▽ More

    Submitted 7 February, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

  14. arXiv:2310.13859  [pdf, other

    cs.CL cs.CV

    Not all Fake News is Written: A Dataset and Analysis of Misleading Video Headlines

    Authors: Yoo Yeon Sung, Jordan Boyd-Graber, Naeemul Hassan

    Abstract: Polarization and the marketplace for impressions have conspired to make navigating information online difficult for users, and while there has been a significant effort to detect false or misleading text, multimodal datasets have received considerably less attention. To complement existing resources, we present multimodal Video Misleading Headline (VMH), a dataset that consists of videos and wheth… ▽ More

    Submitted 14 December, 2023; v1 submitted 20 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023 Main Paper

  15. arXiv:2310.12558  [pdf, other

    cs.CL cs.HC

    Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

    Authors: Chenglei Si, Navita Goyal, Sherry Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daumé III, Jordan Boyd-Graber

    Abstract: Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they get, LLMs should not only provide information but also help users fact-check it. Our experiments with 80 crowdworkers compare language models with search engines (information retrieva… ▽ More

    Submitted 1 April, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

    Comments: NAACL 2024

  16. arXiv:2307.07049  [pdf, other

    cs.CL

    MegaWika: Millions of reports and their sources across 50 diverse languages

    Authors: Samuel Barham, Orion Weller, Michelle Yuan, Kenton Murray, Mahsa Yarmohammadi, Zhengping Jiang, Siddharth Vashishtha, Alexander Martin, Anqi Liu, Aaron Steven White, Jordan Boyd-Graber, Benjamin Van Durme

    Abstract: To foster the development of new models for collaborative AI-assisted report generation, we introduce MegaWika, consisting of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, going beyond the initial Wikipedia citation extraction and web scraping of content, including translating no… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: Submitted to ACL, 2023

    ACM Class: I.2.7

  17. arXiv:2305.14659  [pdf, other

    cs.CL

    InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction

    Authors: Ishani Mondal, Michelle Yuan, Anandhavelu N, Aparna Garimella, Francis Ferraro, Andrew Blair-Stanek, Benjamin Van Durme, Jordan Boyd-Graber

    Abstract: Learning template based information extraction from documents is a crucial yet difficult task. Prior template-based IE approaches assume foreknowledge of the domain templates; however, real-world IE do not have pre-defined schemas and it is a figure-out-as you go phenomena. To quickly bootstrap templates in a real-world setting, we need to induce template slots from documents with zero or minimal… ▽ More

    Submitted 17 November, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Version 2

  18. arXiv:2305.14628  [pdf, other

    cs.CL cs.AI

    Getting MoRE out of Mixture of Language Model Reasoning Experts

    Authors: Chenglei Si, Weijia Shi, Chen Zhao, Luke Zettlemoyer, Jordan Boyd-Graber

    Abstract: While recent large language models (LLMs) improve on various question answering (QA) datasets, it remains difficult for a single model to generalize across question types that require distinct reasoning abilities. We provide empirical evidence that state-of-the-art LLMs suffer from poor generalizability on reasoning types beyond those seen in the prompt. To remedy this, we propose a Mixture-of-Rea… ▽ More

    Submitted 20 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023 Findings

  19. arXiv:2212.03296  [pdf, other

    cs.CL cs.AI

    Cheater's Bowl: Human vs. Computer Search Strategies for Open-Domain Question Answering

    Authors: Wanrong He, Andrew Mao, Jordan Boyd-Graber

    Abstract: For humans and computers, the first step in answering an open-domain question is retrieving a set of relevant documents from a large corpus. However, the strategies that computers use fundamentally differ from those of humans. To better understand these differences, we design a gamified interface for data collection -- Cheater's Bowl -- where a human answers complex questions with access to both t… ▽ More

    Submitted 15 November, 2022; originally announced December 2022.

    Comments: Findings of EMNLP 2022

  20. arXiv:2210.09150  [pdf, other

    cs.CL

    Prompting GPT-3 To Be Reliable

    Authors: Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, Lijuan Wang

    Abstract: Large language models (LLMs) show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. However, the crucial problem of how to improve the reliability of GPT-3 is still under-explored. While reliability is a broad and vaguely defined term, we decompose reliability into four main facets that correspond t… ▽ More

    Submitted 14 February, 2023; v1 submitted 17 October, 2022; originally announced October 2022.

    Comments: ICLR 2023

  21. arXiv:2210.06599  [pdf, other

    cs.CL

    Improving Question Answering with Generation of NQ-like Questions

    Authors: Saptarashmi Bandyopadhyay, Shraman Pal, Hao Zou, Abhranil Chandra, Jordan Boyd-Graber

    Abstract: Question Answering (QA) systems require a large amount of annotated data which is costly and time-consuming to gather. Converting datasets of existing QA benchmarks are challenging due to different formats and complexities. To address these issues, we propose an algorithm to automatically generate shorter questions resembling day-to-day human communication in the Natural Questions (NQ) dataset fro… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

  22. arXiv:2205.12507  [pdf, other

    cs.CL

    Re-Examining Calibration: The Case of Question Answering

    Authors: Chenglei Si, Chen Zhao, Sewon Min, Jordan Boyd-Graber

    Abstract: For users to trust model predictions, they need to understand model outputs, particularly their confidence - calibration aims to adjust (calibrate) models' confidence to match expected accuracy. We argue that the traditional calibration evaluation does not promote effective calibrations: for example, it can encourage always assigning a mediocre confidence score to all predictions, which does not h… ▽ More

    Submitted 23 October, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

    Comments: EMNLP 2022 Findings

  23. arXiv:2203.13420  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Automatic Song Translation for Tonal Languages

    Authors: Fenfei Guo, Chen Zhang, Zhirui Zhang, Qixin He, Kejun Zhang, Jun Xie, Jordan Boyd-Graber

    Abstract: This paper develops automatic song translation (AST) for tonal languages and addresses the unique challenge of aligning words' tones with melody of a song in addition to conveying the original meaning. We propose three criteria for effective AST -- preserving meaning, singability and intelligibility -- and design metrics for these criteria. We develop a new benchmark for English--Mandarin song tra… ▽ More

    Submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted at Findings of ACL 2022, 15 pages, 4 Tables and 10 Figures

  24. arXiv:2203.10753  [pdf, other

    cs.CL

    Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability

    Authors: Yoshinari Fujinuma, Jordan Boyd-Graber, Katharina Kann

    Abstract: Pretrained multilingual models enable zero-shot learning even for unseen languages, and that performance can be further improved via adaptation prior to finetuning. However, it is unclear how the number of pretraining languages influences a model's zero-shot learning for languages unseen during pretraining. To fill this gap, we ask the following research questions: (1) How does the number of pretr… ▽ More

    Submitted 21 March, 2022; originally announced March 2022.

    Comments: ACL 2022 camera ready

  25. arXiv:2110.04889  [pdf, other

    cs.CL

    Distantly-Supervised Evidence Retrieval Enables Question Answering without Evidence Annotation

    Authors: Chen Zhao, Chenyan Xiong, Jordan Boyd-Graber, Hal Daumé III

    Abstract: Open-domain question answering answers a question based on evidence retrieved from a large corpus. State-of-the-art neural approaches require intermediate evidence annotations for training. However, such intermediate annotations are expensive, and methods that rely on them cannot transfer to the more common setting, where only question-answer pairs are available. This paper investigates whether mo… ▽ More

    Submitted 10 October, 2021; originally announced October 2021.

    Comments: EMNLP 2021

  26. arXiv:2109.05289  [pdf, other

    cs.CL

    What's in a Name? Answer Equivalence For Open-Domain Question Answering

    Authors: Chenglei Si, Chen Zhao, Jordan Boyd-Graber

    Abstract: A flaw in QA evaluation is that annotations often only provide one gold answer. Thus, model predictions semantically equivalent to the answer but superficially different are considered incorrect. This work explores mining alias entities from knowledge bases and using them as additional gold answers (i.e., equivalent answers). We incorporate answers for two settings: evaluation with additional answ… ▽ More

    Submitted 11 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021 main conference

  27. arXiv:2107.08146  [pdf, other

    cs.CL

    Picard understanding Darmok: A Dataset and Model for Metaphor-Rich Translation in a Constructed Language

    Authors: Peter Jansen, Jordan Boyd-Graber

    Abstract: Tamarian, a fictional language introduced in the Star Trek episode Darmok, communicates meaning through utterances of metaphorical references, such as "Darmok and Jalad at Tanagra" instead of "We should work together." This work assembles a Tamarian-English dictionary of utterances from the original episode and several follow-on novels, and uses this to construct a parallel corpus of 456 English-T… ▽ More

    Submitted 14 October, 2022; v1 submitted 16 July, 2021; originally announced July 2021.

    Comments: Accepted to the the 2022 Workshop on Figurative Language Processing (at EMNLP 2022)

  28. arXiv:2107.02173  [pdf, other

    cs.CL cs.LG

    Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence

    Authors: Alexander Hoyle, Pranav Goel, Denis Peskov, Andrew Hian-Cheong, Jordan Boyd-Graber, Philip Resnik

    Abstract: Topic model evaluation, like evaluation of other unsupervised methods, can be contentious. However, the field has coalesced around automated estimates of topic coherence, which rely on the frequency of word co-occurrences in a reference corpus. Contemporary neural topic models surpass classical ones according to these metrics. At the same time, topic model evaluation suffers from a validation gap:… ▽ More

    Submitted 27 October, 2021; v1 submitted 5 July, 2021; originally announced July 2021.

    Comments: Accepted to NeurIPS 2021 (spotlight presentation). CR version

  29. arXiv:2104.07611  [pdf, other

    cs.CL

    Adapting Coreference Resolution Models through Active Learning

    Authors: Michelle Yuan, Patrick Xia, Chandler May, Benjamin Van Durme, Jordan Boyd-Graber

    Abstract: Neural coreference resolution models trained on one dataset may not transfer to new, low-resource domains. Active learning mitigates this problem by sampling a small subset of data for annotators to label. While active learning is well-defined for classification tasks, its application to coreference resolution is neither well-defined nor fully understood. This paper explores how to actively label… ▽ More

    Submitted 28 March, 2022; v1 submitted 15 April, 2021; originally announced April 2021.

    Comments: Accepted at ACL 2022 Main Conference

  30. arXiv:2104.07571  [pdf, other

    cs.CL

    Toward Deconfounding the Influence of Entity Demographics for Question Answering Accuracy

    Authors: Maharshi Gor, Kellie Webster, Jordan Boyd-Graber

    Abstract: The goal of question answering (QA) is to answer any question. However, major QA datasets have skewed distributions over gender, profession, and nationality. Despite that skew, model accuracy analysis reveals little evidence that accuracy is lower for people based on gender or nationality; instead, there is more variation on professions (question topic). But QA's lack of representation could itsel… ▽ More

    Submitted 10 September, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

    Comments: Accepted at EMNLP 2021

  31. arXiv:2104.05883  [pdf, other

    cs.CL

    Multi-Step Reasoning Over Unstructured Text with Beam Dense Retrieval

    Authors: Chen Zhao, Chenyan Xiong, Jordan Boyd-Graber, Hal Daumé III

    Abstract: Complex question answering often requires finding a reasoning chain that consists of multiple evidence pieces. Current approaches incorporate the strengths of structured knowledge and unstructured text, assuming text corpora is semi-structured. Building on dense retrieval methods, we propose a new multi-step retrieval approach (BeamDR) that iteratively forms an evidence chain through beam search i… ▽ More

    Submitted 12 April, 2021; originally announced April 2021.

    Comments: NAACL 2021

  32. arXiv:2104.04725  [pdf, other

    cs.CL

    Fool Me Twice: Entailment from Wikipedia Gamification

    Authors: Julian Martin Eisenschlos, Bhuwan Dhingra, Jannis Bulian, Benjamin Börschinger, Jordan Boyd-Graber

    Abstract: We release FoolMeTwice (FM2 for short), a large dataset of challenging entailment pairs collected through a fun multi-player game. Gamification encourages adversarial examples, drastically lowering the number of examples that can be solved using "shortcuts" compared to other popular entailment datasets. Players are presented with two tasks. The first task asks the player to write a plausible claim… ▽ More

    Submitted 10 April, 2021; originally announced April 2021.

    Comments: Published in NAACL 2021

  33. Complex Factoid Question Answering with a Free-Text Knowledge Graph

    Authors: Chen Zhao, Chenyan Xiong, Xin Qian, Jordan Boyd-Graber

    Abstract: We introduce DELFT, a factoid question answering system which combines the nuance and depth of knowledge graph question answering approaches with the broader coverage of free-text. DELFT builds a free-text knowledge graph from Wikipedia, with entities as nodes and sentences in which entities co-occur as edges. For each question, DELFT finds the subgraph linking question entity nodes to candidates… ▽ More

    Submitted 23 March, 2021; originally announced March 2021.

    Comments: WWW2020

  34. arXiv:2101.00133  [pdf, other

    cs.CL cs.AI

    NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned

    Authors: Sewon Min, Jordan Boyd-Graber, Chris Alberti, Danqi Chen, Eunsol Choi, Michael Collins, Kelvin Guu, Hannaneh Hajishirzi, Kenton Lee, Jennimaria Palomaki, Colin Raffel, Adam Roberts, Tom Kwiatkowski, Patrick Lewis, Yuxiang Wu, Heinrich Küttler, Linqing Liu, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel, Sohee Yang, Minjoon Seo, Gautier Izacard, Fabio Petroni, Lucas Hosseini , et al. (28 additional authors not shown)

    Abstract: We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage conte… ▽ More

    Submitted 19 September, 2021; v1 submitted 31 December, 2020; originally announced January 2021.

    Comments: 26 pages; Published in Proceedings of Machine Learning Research (PMLR), NeurIPS 2020 Competition and Demonstration Track

  35. arXiv:2012.00614  [pdf, other

    cs.CL cs.AI

    CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims

    Authors: Thomas Diggelmann, Jordan Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, Markus Leippold

    Abstract: We introduce CLIMATE-FEVER, a new publicly available dataset for verification of climate change-related claims. By providing a dataset for the research community, we aim to facilitate and encourage work on improving algorithms for retrieving evidential support for climate-specific claims, addressing the underlying language understanding challenges, and ultimately help alleviate the impact of misin… ▽ More

    Submitted 2 January, 2021; v1 submitted 1 December, 2020; originally announced December 2020.

    Comments: Accepted for the Tackling Climate Change with Machine Learning Workshop at NeurIPS 2020

  36. arXiv:2012.00483  [pdf, other

    cs.CL cs.AI

    ClimaText: A Dataset for Climate Change Topic Detection

    Authors: Francesco S. Varini, Jordan Boyd-Graber, Massimiliano Ciaramita, Markus Leippold

    Abstract: Climate change communication in the mass media and other textual sources may affect and shape public perception. Extracting climate change information from these sources is an important task, e.g., for filtering content and e-discovery, sentiment analysis, automatic summarization, question-answering, and fact-checking. However, automating this process is a challenge, as climate change is a complex… ▽ More

    Submitted 2 January, 2021; v1 submitted 1 December, 2020; originally announced December 2020.

    Comments: Accepted for the Tackling Climate Change with Machine Learning Workshop at NeurIPS 2020

  37. arXiv:2010.11246  [pdf, other

    cs.CL cs.AI

    On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries

    Authors: Tianze Shi, Chen Zhao, Jordan Boyd-Graber, Hal Daumé III, Lillian Lee

    Abstract: Large-scale semantic parsing datasets annotated with logical forms have enabled major advances in supervised approaches. But can richer supervision help even more? To explore the utility of fine-grained, lexical-level supervision, we introduce Squall, a dataset that enriches 11,276 WikiTableQuestions English-language questions with manually created SQL equivalents plus alignments between SQL and q… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

    Comments: Findings of ACL: EMNLP 2020

    ACM Class: I.2.7

    Journal ref: Findings of ACL: EMNLP 2020

  38. arXiv:2010.09535  [pdf, other

    cs.CL cs.LG

    Cold-start Active Learning through Self-supervised Language Modeling

    Authors: Michelle Yuan, Hsuan-Tien Lin, Jordan Boyd-Graber

    Abstract: Active learning strives to reduce annotation costs by choosing the most critical examples to label. Typically, the active learning strategy is contingent on the classification model. For instance, uncertainty sampling depends on poorly calibrated model confidence scores. In the cold-start setting, active learning is impractical because of model instability and data scarcity. Fortunately, modern NL… ▽ More

    Submitted 22 October, 2020; v1 submitted 19 October, 2020; originally announced October 2020.

    Comments: Published in EMNLP 2020

  39. arXiv:2005.00524  [pdf, other

    cs.CL cs.LG

    Why Overfitting Isn't Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries

    Authors: Mozhi Zhang, Yoshinari Fujinuma, Michael J. Paul, Jordan Boyd-Graber

    Abstract: Cross-lingual word embeddings (CLWE) are often evaluated on bilingual lexicon induction (BLI). Recent CLWE methods use linear projections, which underfit the training dictionary, to generalize on BLI. However, underfitting can hinder generalization to other downstream tasks that rely on words from the training dictionary. We address this limitation by retrofitting CLWE to the training dictionary,… ▽ More

    Submitted 1 May, 2020; originally announced May 2020.

    Comments: ACL 2020

  40. arXiv:1911.04156  [pdf, other

    cs.CL cs.AI

    Meta Answering for Machine Reading

    Authors: Benjamin Borschinger, Jordan Boyd-Graber, Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Michelle Chen Huebscher, Wojciech Gajewski, Yannic Kilcher, Rodrigo Nogueira, Lierni Sestorain Saralegu

    Abstract: We investigate a framework for machine reading, inspired by real world information-seeking problems, where a meta question answering system interacts with a black box environment. The environment encapsulates a competitive machine reader based on BERT, providing candidate answers to questions, and possibly some context. To validate the realism of our formulation, we ask humans to play the role of… ▽ More

    Submitted 30 April, 2020; v1 submitted 11 November, 2019; originally announced November 2019.

  41. arXiv:1911.03070  [pdf, other

    cs.CL cs.LG

    Interactive Refinement of Cross-Lingual Word Embeddings

    Authors: Michelle Yuan, Mozhi Zhang, Benjamin Van Durme, Leah Findlater, Jordan Boyd-Graber

    Abstract: Cross-lingual word embeddings transfer knowledge between languages: models trained on high-resource languages can predict in low-resource languages. We introduce CLIME, an interactive system to quickly refine cross-lingual word embeddings for a given classification problem. First, CLIME ranks words by their salience to the downstream task. Then, users mark similarity between keywords and their nea… ▽ More

    Submitted 3 June, 2021; v1 submitted 8 November, 2019; originally announced November 2019.

    Comments: EMNLP 2020; first two authors contribute equally

  42. arXiv:1910.14464  [pdf, other

    cs.CL cs.AI cs.LG

    What Question Answering can Learn from Trivia Nerds

    Authors: Jordan Boyd-Graber, Benjamin Börschinger

    Abstract: In addition to the traditional task of getting machines to answer questions, a major research question in question answering is to create interesting, challenging questions that can help systems learn how to answer questions and also reveal which systems are the best at answering questions. We argue that creating a question answering dataset -- and the ubiquitous leaderboard that goes with it -- c… ▽ More

    Submitted 21 April, 2020; v1 submitted 31 October, 2019; originally announced October 2019.

    Journal ref: ACL 2020

  43. arXiv:1908.02914  [pdf, other

    cs.CL

    Mitigating Noisy Inputs for Question Answering

    Authors: Denis Peskov, Joe Barrow, Pedro Rodriguez, Graham Neubig, Jordan Boyd-Graber

    Abstract: Natural language processing systems are often downstream of unreliable inputs: machine translation, optical character recognition, or speech recognition. For instance, virtual assistants can only answer your questions after understanding your speech. We investigate and mitigate the effects of noise from Automatic Speech Recognition systems on two factoid Question Answering (QA) tasks. Integrating… ▽ More

    Submitted 7 August, 2019; originally announced August 2019.

  44. A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity

    Authors: Yoshinari Fujinuma, Jordan Boyd-Graber, Michael J. Paul

    Abstract: Cross-lingual word embeddings encode the meaning of words from different languages into a shared low-dimensional space. An important requirement for many downstream tasks is that word similarity should be independent of language - i.e., word vectors within one language should not be more similar to each other than to words in another language. We measure this characteristic using modularity, a net… ▽ More

    Submitted 5 June, 2019; originally announced June 2019.

    Comments: Accepted to ACL 2019, camera-ready

  45. arXiv:1906.01622  [pdf, other

    cs.CL cs.AI cs.LG

    Are Girls Neko or Shōjo? Cross-Lingual Alignment of Non-Isomorphic Embeddings with Iterative Normalization

    Authors: Mozhi Zhang, Keyulu Xu, Ken-ichi Kawarabayashi, Stefanie Jegelka, Jordan Boyd-Graber

    Abstract: Cross-lingual word embeddings (CLWE) underlie many multilingual natural language processing systems, often through orthogonal transformations of pre-trained monolingual embeddings. However, orthogonal mapping only works on language pairs whose embeddings are naturally isomorphic. For non-isomorphic pairs, our method (Iterative Normalization) transforms monolingual embeddings to make orthogonal ali… ▽ More

    Submitted 11 November, 2019; v1 submitted 4 June, 2019; originally announced June 2019.

    Comments: ACL 2019

  46. arXiv:1905.13126  [pdf, other

    cs.IR cs.CL cs.LG stat.ML

    Automatic Evaluation of Local Topic Quality

    Authors: Jeffrey Lund, Piper Armstrong, Wilson Fearn, Stephen Cowley, Courtni Byun, Jordan Boyd-Graber, Kevin Seppi

    Abstract: Topic models are typically evaluated with respect to the global topic distributions that they generate, using metrics such as coherence, but without regard to local (token-level) topic assignments. Token-level assignments are important for downstream tasks such as classification. Even recent models, which aim to improve the quality of these token-level topic assignments, have been evaluated only w… ▽ More

    Submitted 17 May, 2019; originally announced May 2019.

    Comments: 8 pages 4 figures 3 tables

  47. arXiv:1905.09864  [pdf, ps, other

    cs.CL cs.HC cs.IR cs.LG

    Why Didn't You Listen to Me? Comparing User Control of Human-in-the-Loop Topic Models

    Authors: Varun Kumar, Alison Smith-Renner, Leah Findlater, Kevin Seppi, Jordan Boyd-Graber

    Abstract: To address the lack of comparative evaluation of Human-in-the-Loop Topic Modeling (HLTM) systems, we implement and evaluate three contrasting HLTM modeling approaches using simulation experiments. These approaches extend previously proposed frameworks, including constraints and informed prior-based methods. Users should have a sense of control in HLTM systems, so we propose a control metric to mea… ▽ More

    Submitted 3 June, 2019; v1 submitted 23 May, 2019; originally announced May 2019.

    Comments: In proceedings of ACL 2019

  48. arXiv:1905.05778  [pdf, ps, other

    cs.LG cs.AI cs.CL stat.ML

    Misleading Failures of Partial-input Baselines

    Authors: Shi Feng, Eric Wallace, Jordan Boyd-Graber

    Abstract: Recent work establishes dataset difficulty and removes annotation artifacts via partial-input baselines (e.g., hypothesis-only models for SNLI or question-only models for VQA). When a partial-input baseline gets high accuracy, a dataset is cheatable. However, the converse is not necessarily true: the failure of a partial-input baseline does not mean a dataset is free of artifacts. To illustrate th… ▽ More

    Submitted 18 June, 2019; v1 submitted 14 May, 2019; originally announced May 2019.

    Comments: ACL 2019

  49. arXiv:1904.04792  [pdf, other

    cs.CL

    Quizbowl: The Case for Incremental Question Answering

    Authors: Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, Jordan Boyd-Graber

    Abstract: Scholastic trivia competitions test knowledge and intelligence through mastery of question answering. Modern question answering benchmarks are one variant of the Turing test. Specifically, answering a set of questions as well as a human is a minimum bar towards demonstrating human-like intelligence. This paper makes the case that the format of one competition -- where participants can answer in th… ▽ More

    Submitted 11 February, 2021; v1 submitted 9 April, 2019; originally announced April 2019.

  50. arXiv:1901.10991  [pdf, other

    math.OC

    Tensor Robust Principal Component Analysis: Better recovery with atomic norm regularization

    Authors: Derek Driggs, Stephen Becker, Jordan Boyd-Graber

    Abstract: This paper studies tensor-based Robust Principal Component Analysis (RPCA) using atomic-norm regularization. Given the superposition of a sparse and a low-rank tensor, we present conditions under which it is possible to exactly recover the sparse and low-rank components. Our results improve on existing performance guarantees for tensor-RPCA, including those for matrix RPCA. Our guarantees also sho… ▽ More

    Submitted 30 January, 2019; originally announced January 2019.

    Comments: 39 pages, 3 figures, 3 tables

    MSC Class: 90C25; 15A69; 15A83