(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–10 of 10 results for author: Boratko, M

.
  1. arXiv:2406.13121  [pdf, other

    cs.CL cs.AI cs.IR

    Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

    Authors: Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, Kelvin Guu

    Abstract: Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 29 pages. Dataset available at https://github.com/google-deepmind/loft

  2. arXiv:2406.04145  [pdf, other

    cs.CL cs.AI

    Every Answer Matters: Evaluating Commonsense with Probabilistic Measures

    Authors: Qi Cheng, Michael Boratko, Pranay Kumar Yelugam, Tim O'Gorman, Nalini Singh, Andrew McCallum, Xiang Lorraine Li

    Abstract: Large language models have demonstrated impressive performance on commonsense tasks; however, these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic with multiple correct answers. The purpose of "boiling water" could be making tea and cooking, but it also could be killing germs. Existing tasks do not capt… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: ACL 2024 Camera Ready

  3. arXiv:2403.20327  [pdf, other

    cs.CL cs.AI

    Gecko: Versatile Text Embeddings Distilled from Large Language Models

    Authors: Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, Iftekhar Naim

    Abstract: We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs) into a retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

    Comments: 18 pages

  4. arXiv:2109.04997  [pdf, other

    cs.CL cs.LG

    Box Embeddings: An open-source library for representation learning using geometric structures

    Authors: Tejas Chheda, Purujit Goyal, Trang Tran, Dhruvesh Patel, Michael Boratko, Shib Sankar Dasgupta, Andrew McCallum

    Abstract: A major factor contributing to the success of modern representation learning is the ease of performing various vector operations. Recently, objects with geometric structures (eg. distributions, complex or hyperbolic vectors, or regions such as cones, disks, or boxes) have been explored for their alternative inductive biases and additional representational capacities. In this work, we introduce Box… ▽ More

    Submitted 10 September, 2021; originally announced September 2021.

    Comments: The source code and the usage and API documentation for the library is available at https://github.com/iesl/box-embeddings and https://www.iesl.cs.umass.edu/box-embeddings/main/index.html

  5. arXiv:2106.14361  [pdf, other

    cs.CL cs.AI

    Word2Box: Capturing Set-Theoretic Semantics of Words using Box Embeddings

    Authors: Shib Sankar Dasgupta, Michael Boratko, Siddhartha Mishra, Shriya Atmakuri, Dhruvesh Patel, Xiang Lorraine Li, Andrew McCallum

    Abstract: Learning representations of words in a continuous space is perhaps the most fundamental task in NLP, however words interact in ways much richer than vector dot product similarity can provide. Many relationships between words can be expressed set-theoretically, for example, adjective-noun compounds (eg. "red cars"$\subseteq$"cars") and homographs (eg. "tongue"$\cap$"body" should be similar to "mout… ▽ More

    Submitted 8 June, 2022; v1 submitted 27 June, 2021; originally announced June 2021.

    Journal ref: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022

  6. arXiv:2104.04597  [pdf, other

    cs.AI cs.CL cs.LG

    Probabilistic Box Embeddings for Uncertain Knowledge Graph Reasoning

    Authors: Xuelu Chen, Michael Boratko, Muhao Chen, Shib Sankar Dasgupta, Xiang Lorraine Li, Andrew McCallum

    Abstract: Knowledge bases often consist of facts which are harvested from a variety of sources, many of which are noisy and some of which conflict, resulting in a level of uncertainty for each triple. Knowledge bases are also often incomplete, prompting the use of embedding methods to generalize from known facts, however, existing embedding methods only model triple-level uncertainty, and reasoning results… ▽ More

    Submitted 9 April, 2021; originally announced April 2021.

    Comments: NAACL-HLT 2021

  7. arXiv:2101.00345  [pdf, other

    cs.CL cs.AI cs.LG

    Modeling Fine-Grained Entity Types with Box Embeddings

    Authors: Yasumasa Onoe, Michael Boratko, Andrew McCallum, Greg Durrett

    Abstract: Neural entity typing models typically represent fine-grained entity types as vectors in a high-dimensional space, but such spaces are not well-suited to modeling these types' complex interdependencies. We study the ability of box embeddings, which embed concepts as d-dimensional hyperrectangles, to capture hierarchies of types even when these relationships are not defined explicitly in the ontolog… ▽ More

    Submitted 3 June, 2021; v1 submitted 1 January, 2021; originally announced January 2021.

    Comments: ACL 2021

  8. arXiv:2010.04831  [pdf, other

    cs.LG cs.AI stat.ML

    Improving Local Identifiability in Probabilistic Box Embeddings

    Authors: Shib Sankar Dasgupta, Michael Boratko, Dongxu Zhang, Luke Vilnis, Xiang Lorraine Li, Andrew McCallum

    Abstract: Geometric embeddings have recently received attention for their natural ability to represent transitive asymmetric relations via containment. Box embeddings, where objects are represented by n-dimensional hyperrectangles, are a particularly promising example of such an embedding as they are closed under intersection and their volume can be calculated easily, allowing them to naturally represent ca… ▽ More

    Submitted 28 October, 2020; v1 submitted 9 October, 2020; originally announced October 2020.

    Comments: Accepted at NeurIPS2020

  9. arXiv:2005.00771  [pdf, other

    cs.CL

    ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning

    Authors: Michael Boratko, Xiang Lorraine Li, Rajarshi Das, Tim O'Gorman, Dan Le, Andrew McCallum

    Abstract: Given questions regarding some prototypical situation such as Name something that people usually do before they leave the house for work? a human can easily answer them via acquired experiences. There can be multiple right answers for such questions, with some more common for a situation than others. This paper introduces a new question answering dataset for training and evaluating common sense re… ▽ More

    Submitted 27 October, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: First four authors contribute equally

  10. arXiv:1806.00358  [pdf, other

    cs.AI cs.CL cs.IR

    A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset

    Authors: Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik Talamadupula, Michael Witbrock

    Abstract: The recent work of Clark et al. introduces the AI2 Reasoning Challenge (ARC) and the associated ARC dataset that partitions open domain, complex science questions into an Easy Set and a Challenge Set. That paper includes an analysis of 100 questions with respect to the types of knowledge and reasoning required to answer them; however, it does not include clear definitions of these types, nor does… ▽ More

    Submitted 4 February, 2019; v1 submitted 1 June, 2018; originally announced June 2018.

    Comments: Presented at the Machine Reading for Question Answering (MRQA 2018) Workshop at the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2018). 11 pages, 5 tables, 4 figures. Added missing citations in the latest draft