(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 264 results for author: Bansal, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.21783  [pdf, other

    cs.AI cs.CL cs.CV

    The Llama 3 Herd of Models

    Authors: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang , et al. (508 additional authors not shown)

    Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

  2. arXiv:2407.14414  [pdf, other

    cs.AI cs.CL cs.LG

    System-1.x: Learning to Balance Fast and Slow Planning with Language Models

    Authors: Swarnadeep Saha, Archiki Prasad, Justin Chih-Yao Chen, Peter Hase, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Language models can be used to solve long-horizon planning problems in two distinct modes: a fast 'System-1' mode, directly generating plans without any explicit search or backtracking, and a slow 'System-2' mode, planning step-by-step by explicitly searching over possible actions. While System-2 is typically more effective, it is also more computationally expensive, making it infeasible for long… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.

    Comments: 29 pages (10 tables)

  3. arXiv:2407.07035  [pdf, other

    cs.CL cs.CV

    Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

    Authors: Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, Parisa Kordjamshidi

    Abstract: Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning, and emphasizes the… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: Authors contributed equally to this work, and supervisors contributed equal advising to this work

  4. arXiv:2406.19354  [pdf, other

    cs.CL cs.AI

    Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?

    Authors: Peter Hase, Thomas Hofweber, Xiang Zhou, Elias Stengel-Eskin, Mohit Bansal

    Abstract: The model editing problem concerns how language models should learn new facts about the world over time. While empirical research on model editing has drawn widespread attention, the conceptual foundations of model editing remain shaky -- perhaps unsurprisingly, since model editing is essentially belief revision, a storied problem in philosophy that has eluded succinct solutions for decades. Model… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 23 pages, 4 figures

  5. arXiv:2406.13023  [pdf

    math.OC cs.LG

    Stackelberg Games with $k$-Submodular Function under Distributional Risk-Receptiveness and Robustness

    Authors: Seonghun Park, Manish Bansal

    Abstract: We study submodular optimization in adversarial context, applicable to machine learning problems such as feature selection using data susceptible to uncertainties and attacks. We focus on Stackelberg games between an attacker (or interdictor) and a defender where the attacker aims to minimize the defender's objective of maximizing a $k$-submodular function. We allow uncertainties arising from the… ▽ More

    Submitted 28 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

  6. arXiv:2406.11665  [pdf, other

    cs.CL cs.AI cs.CV

    See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

    Authors: Amith Ananthram, Elias Stengel-Eskin, Carl Vondrick, Mohit Bansal, Kathleen McKeown

    Abstract: Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from Eastern cultures attend more to scene context. In this work, we present a novel investigation that demonstrates and localizes VLMs' Western… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 17 pages, 7 figures. Code/models: https://github.com/amith-ananthram/see-it-from-my-perspective

  7. arXiv:2406.07735  [pdf, other

    cs.CL cs.LG

    REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy

    Authors: Haw-Shiuan Chang, Nanyun Peng, Mohit Bansal, Anil Ramakrishna, Tagyoung Chung

    Abstract: Decoding methods for large language models (LLMs) usually struggle with the tradeoff between ensuring factuality and maintaining diversity. For example, a higher p threshold in the nucleus (top-p) sampling increases the diversity but decreases the factuality, and vice versa. In this paper, we propose REAL (Residual Entropy from Asymptotic Line) sampling, a decoding method that achieves improved fa… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  8. arXiv:2406.03442  [pdf, ps, other

    cs.CL cs.AI

    Are language models rational? The case of coherence norms and belief revision

    Authors: Thomas Hofweber, Peter Hase, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Do norms of rationality apply to machine learning models, in particular language models? In this paper we investigate this question by focusing on a special subset of rational norms: coherence norms. We consider both logical coherence norms as well as coherence norms tied to the strength of belief. To make sense of the latter, we introduce the Minimal Assent Connection (MAC) and propose a new acco… ▽ More

    Submitted 10 August, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: added discussion and cross reference of new empirical work by the authors, updated references, fixed typos

  9. arXiv:2406.00842  [pdf, other

    cs.CL

    The Power of Summary-Source Alignments

    Authors: Ori Ernst, Ori Shapira, Aviv Slobodkin, Sharon Adar, Mohit Bansal, Jacob Goldberger, Ran Levy, Ido Dagan

    Abstract: Multi-document summarization (MDS) is a challenging task, often decomposed to subtasks of salience and redundancy detection, followed by text generation. In this context, alignment of corresponding sentences between a reference summary and its source documents has been leveraged to generate training data for some of the component tasks. Yet, this enabling alignment step has usually been applied he… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL-Findings 2024

  10. arXiv:2405.21028  [pdf, other

    cs.CL cs.AI

    LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models

    Authors: Elias Stengel-Eskin, Peter Hase, Mohit Bansal

    Abstract: When answering questions, LLMs can convey not only an answer, but a level of confidence about the answer being correct. This includes explicit confidence markers (e.g. giving a numeric score) as well as implicit markers, like an authoritative tone or elaborating with additional knowledge. For LLMs to be trustworthy knowledge sources, the confidence they convey should match their actual expertise;… ▽ More

    Submitted 3 July, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

    Comments: 18 pages. Code: https://github.com/esteng/pragmatic_calibration

  11. arXiv:2405.19209  [pdf, other

    cs.CV cs.AI cs.CL

    VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

    Authors: Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal

    Abstract: Video-language understanding tasks have focused on short video clips, often struggling with long-form video understanding tasks. Recently, many long video-language understanding approaches have leveraged the reasoning capabilities of Large Language Models (LLMs) to perform long video QA, transforming videos into densely sampled frame captions, and asking LLMs to respond to text queries over captio… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: 20 pages, first three authors contributed equally; Project page: https://videotree2024.github.io/

  12. arXiv:2405.18406  [pdf, other

    cs.CV cs.AI cs.CL

    RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives

    Authors: Jaehong Yoon, Shoubin Yu, Mohit Bansal

    Abstract: Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supp… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: The first two authors contribute equally. Project Page: https://raccoon-mllm-gen.github.io/

  13. arXiv:2405.04834  [pdf, other

    cs.CV

    FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

    Authors: Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang

    Abstract: Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexibl… ▽ More

    Submitted 21 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

  14. arXiv:2404.09967  [pdf, other

    cs.CV cs.AI cs.LG

    Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

    Authors: Han Lin, Jaemin Cho, Abhay Zala, Mohit Bansal

    Abstract: ControlNets are widely used for adding spatial control to text-to-image diffusion models with different conditions, such as depth maps, scribbles/sketches, and human poses. However, when it comes to controllable video generation, ControlNets cannot be directly integrated into new backbones due to feature space mismatches, and training ControlNets for new backbones can be a significant burden for m… ▽ More

    Submitted 24 May, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

    Comments: First two authors contributed equally; Project page: https://ctrl-adapter.github.io/

  15. arXiv:2404.00741  [pdf, other

    cs.CV

    Rethinking Interactive Image Segmentation with Low Latency, High Quality, and Diverse Prompts

    Authors: Qin Liu, Jaemin Cho, Mohit Bansal, Marc Niethammer

    Abstract: The goal of interactive image segmentation is to delineate specific regions within an image via visual or language prompts. Low-latency and high-quality interactive segmentation with diverse prompts remain challenging for existing specialist and generalist models. Specialist models, with their limited prompts and task-specific designs, experience high latency because the image must be recomputed e… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: CVPR 2024 https://github.com/uncbiag/SegNext

  16. arXiv:2404.00399  [pdf, other

    cs.CL cs.AI cs.LG

    Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

    Authors: Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak , et al. (20 additional authors not shown)

    Abstract: Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, where… ▽ More

    Submitted 23 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

    Comments: Preprint

  17. arXiv:2403.12014  [pdf, other

    cs.CL cs.AI cs.LG

    EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents

    Authors: Abhay Zala, Jaemin Cho, Han Lin, Jaehong Yoon, Mohit Bansal

    Abstract: Recent SOTA approaches for embodied learning via interaction directly employ large language models (LLMs) as agents to determine the next steps in an environment. Due to their world knowledge and reasoning capabilities, LLM agents achieve stronger performance than previous smaller agents based on reinforcement learning (RL); however, frequently calling LLMs is slow and expensive. Instead of direct… ▽ More

    Submitted 12 July, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: COLM 2024; First two authors contributed equally; Project website: https://envgen-llm.github.io/

  18. arXiv:2403.08755  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    DAM: Dynamic Adapter Merging for Continual Video QA Learning

    Authors: Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, Gedas Bertasius

    Abstract: We present a parameter-efficient method for continual video question-answering (VidQA) learning. Our method, named DAM, uses the proposed Dynamic Adapter Merging to (i) mitigate catastrophic forgetting, (ii) enable efficient adaptation to continually arriving datasets, (iii) handle inputs from unknown datasets during inference, and (iv) enable knowledge sharing across similar dataset domains. Give… ▽ More

    Submitted 22 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

    Comments: The first two authors contribute equally

  19. arXiv:2403.06952  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data

    Authors: Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal

    Abstract: Recent text-to-image (T2I) generation models have demonstrated impressive capabilities in creating images from text descriptions. However, these T2I generation models often fall short of generating images that precisely match the details of the text inputs, such as incorrect spatial relationship or missing objects. In this paper, we introduce SELMA: Skill-Specific Expert Learning and Merging with… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Comments: First two authors contributed equally; Project website: https://selma-t2i.github.io/

  20. arXiv:2403.02325  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

    Authors: David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a "visual prompt", where visual markers such as bounding boxes delineate key image regions. However, current VLMs that can incorporate visual… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: Project website: https://contrastive-region-guidance.github.io/

  21. arXiv:2402.18479  [pdf, other

    cs.CL

    NewsQs: Multi-Source Question Generation for the Inquiring Mind

    Authors: Alyssa Hwang, Kalpit Dixit, Miguel Ballesteros, Yassine Benajiba, Vittorio Castelli, Markus Dreyer, Mohit Bansal, Kathleen McKeown

    Abstract: We present NewsQs (news-cues), a dataset that provides question-answer pairs for multiple news documents. To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles from the News On the Web corpus. We show that fine-tuning a model with control codes produces questions that are judg… ▽ More

    Submitted 15 June, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: minor wording change

  22. arXiv:2402.17753  [pdf, other

    cs.CL cs.AI cs.LG

    Evaluating Very Long-Term Conversational Memory of LLM Agents

    Authors: Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, Yuwei Fang

    Abstract: Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to gen… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: 19 pages; Project page: https://snap-research.github.io/locomo/

  23. arXiv:2402.13212  [pdf, other

    cs.CL cs.AI cs.LG

    Soft Self-Consistency Improves Language Model Agents

    Authors: Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Generations from large language models (LLMs) can be improved by sampling and scoring multiple solutions to select a final answer. Current "sample and select" methods such as self-consistency (SC) rely on majority voting to score answers. However, when tasks have many distinct and valid answers, selection by voting requires a large number of samples. This makes SC prohibitively expensive for inter… ▽ More

    Submitted 5 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: ACL 2024 Camera-Ready, the first three authors contributed equally; Code: https://github.com/HanNight/soft_self_consistency

  24. arXiv:2402.12348  [pdf, other

    cs.CL cs.AI cs.LG

    GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

    Authors: Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, Kaidi Xu

    Abstract: As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBench, a langu… ▽ More

    Submitted 10 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: 26 pages; the first two authors contributed equally; GTBench HF Leaderboard: https://huggingface.co/spaces/GTBench/GTBench

  25. arXiv:2402.08787  [pdf, other

    cs.LG cs.CL

    Rethinking Machine Unlearning for Large Language Models

    Authors: Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, Yang Liu

    Abstract: We explore machine unlearning (MU) in the domain of large language models (LLMs), referred to as LLM unlearning. This initiative aims to eliminate undesirable data influence (e.g., sensitive or illegal information) and the associated model capabilities, while maintaining the integrity of essential knowledge generation and not affecting causally unrelated information. We envision LLM unlearning bec… ▽ More

    Submitted 14 July, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

  26. arXiv:2402.06492  [pdf, other

    cs.CL cs.AI cs.LG

    Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings

    Authors: Yichen Jiang, Xiang Zhou, Mohit Bansal

    Abstract: Transformers generalize to novel compositions of structures and entities after being trained on a complex dataset, but easily overfit on datasets of insufficient complexity. We observe that when the training set is sufficiently complex, the model encodes sentences that have a common syntactic structure using a systematic attention pattern. Inspired by this observation, we propose SQ-Transformer (S… ▽ More

    Submitted 9 February, 2024; originally announced February 2024.

    Comments: 22 pages, code: https://github.com/jiangycTarheel/SQ-Transformer

  27. arXiv:2402.05889  [pdf, other

    cs.CV cs.AI cs.CL

    CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

    Authors: Shoubin Yu, Jaehong Yoon, Mohit Bansal

    Abstract: Despite impressive advancements in recent multimodal reasoning approaches, they are still limited in flexibility and efficiency, as these models typically process only a few fixed modality inputs and require updates to numerous parameters. This paper tackles these critical challenges and proposes CREMA, a generalizable, highly efficient, and modular modality-fusion framework that can incorporate a… ▽ More

    Submitted 12 June, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

    Comments: first two authors contributed equally. Project page: https://CREMA-VideoLLM.github.io/

  28. arXiv:2402.03702  [pdf, ps, other

    cs.IT cs.NI

    On Learning Spatial Provenance in Privacy-Constrained Wireless Networks

    Authors: Manish Bansal, Pramsu Srivastava, J. Harshan

    Abstract: In Vehicle-to-Everything networks that involve multi-hop communication, the Road Side Units (RSUs) typically aim to collect location information from the participating vehicles to provide security and network diagnostics features. While the vehicles commonly use the Global Positioning System (GPS) for navigation, they may refrain from sharing their precise GPS coordinates with the RSUs due to priv… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: To be presented in IEEE WCNC 2024

  29. arXiv:2402.03561  [pdf, other

    cs.CV cs.AI cs.CL

    VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

    Authors: Jialu Li, Aishwarya Padmakumar, Gaurav Sukhatme, Mohit Bansal

    Abstract: Outdoor Vision-and-Language Navigation (VLN) requires an agent to navigate through realistic 3D outdoor environments based on natural language instructions. The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data. To address these issues, we propose VLN-Video, which utilizes the diverse outdoor environments present in drivin… ▽ More

    Submitted 7 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: AAAI 2024

  30. arXiv:2402.01620  [pdf, other

    cs.CL

    MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

    Authors: Justin Chih-Yao Chen, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Multi-agent interactions between Large Language Model (LLM) agents have shown major improvements on diverse reasoning tasks. However, these involve long generations from multiple models across several rounds, making them expensive. Moreover, these multi-agent approaches fail to provide a final, single model for efficient inference. To address this, we introduce MAGDi, a new method for structured d… ▽ More

    Submitted 7 June, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: ICML 2024 (Camera-ready); First two authors contributed equally; GitHub: https://github.com/dinobby/MAGDi

  31. arXiv:2401.16467  [pdf, other

    cs.SE cs.AI cs.CL cs.LG cs.PL

    ReGAL: Refactoring Programs to Discover Generalizable Abstractions

    Authors: Elias Stengel-Eskin, Archiki Prasad, Mohit Bansal

    Abstract: While large language models (LLMs) are increasingly being used for program synthesis, they lack the global view needed to develop useful abstractions; they generally predict programs one at a time, often repeating the same functionality. Generating redundant code from scratch is both inefficient and error-prone. To address this, we propose Refactoring for Generalizable Abstraction Learning (ReGAL)… ▽ More

    Submitted 6 June, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: ICML 2024 Camera-Ready; First two authors contributed equally; Code: https://github.com/esteng/regal_program_learning

  32. arXiv:2401.15900  [pdf, other

    cs.CV

    MV2MAE: Multi-View Video Masked Autoencoders

    Authors: Ketul Shah, Robert Crandall, Jie Xu, Peng Zhou, Marian George, Mayank Bansal, Rama Chellappa

    Abstract: Videos captured from multiple viewpoints can help in perceiving the 3D structure of the world and benefit computer vision tasks such as action recognition, tracking, etc. In this paper, we present a method for self-supervised learning from synchronized multi-view videos. We use a cross-view reconstruction task to inject geometry information in the model. Our approach is based on the masked autoenc… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

  33. arXiv:2401.10529  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

    Authors: Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong Huang

    Abstract: Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less inve… ▽ More

    Submitted 24 January, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

    Comments: 27 pages, 23 figures

  34. arXiv:2401.06751  [pdf, other

    cs.CL cs.AI cs.LG

    The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

    Authors: Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe

    Abstract: How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as language models have continually improved. In this paper, we present the surprising conclusion that current pretrained language models often generalize relatively well from… ▽ More

    Submitted 5 June, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

    Comments: ACL 2024. 23 pages, 20 figures

  35. arXiv:2401.06638  [pdf, ps, other

    cs.NI cs.CR

    A Prototype on the Feasibility of Learning Spatial Provenance in XBee and LoRa Networks

    Authors: Manish Bansal, Pramsu Shrivastava, J. Harshan

    Abstract: In Vehicle-to-Everything (V2X) networks that involve multi-hop communication, the Road Side Units (RSUs) typically desire to gather the location information of the participating vehicles to provide security and network-diagnostics features. Although Global Positioning System (GPS) based localization is widely used by vehicles for navigation; they may not forward their exact GPS coordinates to the… ▽ More

    Submitted 12 January, 2024; originally announced January 2024.

    Comments: Short paper on prototype demonstration

  36. arXiv:2401.05561  [pdf, other

    cs.CL

    TrustLLM: Trustworthiness in Large Language Models

    Authors: Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang , et al. (45 additional authors not shown)

    Abstract: Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in… ▽ More

    Submitted 17 March, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    Comments: This work is still under work and we welcome your contribution

  37. arXiv:2312.17235  [pdf, other

    cs.CV

    A Simple LLM Framework for Long-Range Video Question-Answering

    Authors: Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius

    Abstract: We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a Large Language Model (GPT-3… ▽ More

    Submitted 26 February, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

  38. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  39. arXiv:2312.04339  [pdf, other

    cs.LG cs.CL

    Merging by Matching Models in Task Parameter Subspaces

    Authors: Derek Tam, Mohit Bansal, Colin Raffel

    Abstract: Model merging aims to cheaply combine individual task-specific models into a single multitask model. In this work, we view past merging methods as leveraging different notions of a ''task parameter subspace'' in which models are matched before being merged. We connect the task parameter subspace of a given model to its loss landscape and formalize how this approach to model merging can be seen as… ▽ More

    Submitted 13 April, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: TMLR

  40. arXiv:2311.18775  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.SD eess.AS

    CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

    Authors: Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal

    Abstract: We present CoDi-2, a versatile and interactive Multimodal Large Language Model (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any input-output modality paradigm. By aligning modalities with language for both encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to not only understand comp… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

    Comments: Project Page: https://codi-2.github.io/

  41. arXiv:2311.16941  [pdf, other

    cs.LG cs.AI cs.CL cs.CV stat.ME

    Debiasing Multimodal Models via Causal Information Minimization

    Authors: Vaidehi Patil, Adyasha Maharana, Mohit Bansal

    Abstract: Most existing debiasing methods for multimodal models, including causal intervention and inference methods, utilize approximate heuristics to represent the biases, such as shallow features from early stages of training or unimodal features for multimodal tasks like VQA, etc., which may not be accurate. In this paper, we study bias arising from confounders in a causal graph for multimodal data and… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

    Comments: EMNLP 2023 Findings (16 pages)

  42. arXiv:2311.13171  [pdf, other

    cs.LG cs.AI cs.CL

    ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

    Authors: Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal

    Abstract: Parameter-efficient fine-tuning (PEFT) techniques make it possible to efficiently adapt a language model to create "expert" models that specialize to new tasks or domains. Recent techniques in model merging and compositional generalization leverage these expert models by dynamically composing modules to improve zero/few-shot generalization. Despite the efficiency of PEFT methods, the size of exper… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: 25 Pages, 6 Figures, 16 Tables

  43. arXiv:2311.10707  [pdf, other

    cs.LG cs.CV

    Multimodal Representation Learning by Alternating Unimodal Adaptation

    Authors: Xiaohui Zhang, Jaehong Yoon, Mohit Bansal, Huaxiu Yao

    Abstract: Multimodal learning, which integrates data from diverse sensory modes, plays a pivotal role in artificial intelligence. However, existing multimodal learning methods often struggle with challenges where some modalities appear more dominant than others during multimodal learning, resulting in suboptimal performance. To address this challenge, we propose MLA (Multimodal Learning with Alternating Uni… ▽ More

    Submitted 1 April, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

    Comments: Accepted by CVPR 2024

  44. arXiv:2311.05772  [pdf, other

    cs.AI cs.CL cs.LG

    ADaPT: As-Needed Decomposition and Planning with Language Models

    Authors: Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, Tushar Khot

    Abstract: Large Language Models (LLMs) are increasingly being used for interactive decision-making tasks requiring planning and adapting to the environment. Recent works employ LLMs-as-agents in broadly two ways: iteratively determining the next action (iterative executors) or generating plans and executing sub-tasks using LLMs (plan-and-execute). However, these methods struggle with task complexity, as the… ▽ More

    Submitted 8 April, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 (findings) camera-ready. Project Page: https://allenai.github.io/adaptllm

  45. arXiv:2311.04420  [pdf, other

    cs.CL cs.AI cs.LG

    Data Factors for Better Compositional Generalization

    Authors: Xiang Zhou, Yichen Jiang, Mohit Bansal

    Abstract: Recent diagnostic datasets on compositional generalization, such as SCAN (Lake and Baroni, 2018) and COGS (Kim and Linzen, 2020), expose severe problems in models trained from scratch on these datasets. However, in contrast to this poor performance, state-of-the-art models trained on larger and more general datasets show better generalization ability. In this work, to reconcile this inconsistency,… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

    Comments: EMNLP 2023 (18 pages)

  46. arXiv:2310.18235  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation

    Authors: Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson, Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Pont-Tuset, Su Wang

    Abstract: Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set of questions and answers from the prompt, and output images are scored based on whether these answers extracted with a visual question answering model… ▽ More

    Submitted 13 March, 2024; v1 submitted 27 October, 2023; originally announced October 2023.

    Comments: ICLR 2024; Project website: https://google.github.io/dsg

  47. arXiv:2310.15123  [pdf, other

    cs.CL cs.AI cs.LG

    Branch-Solve-Merge Improves Large Language Model Evaluation and Generation

    Authors: Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, Xian Li

    Abstract: Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects and criteria. However, their performance can fall short, due to the model's lack of coherence and inability to plan and decompose the problem. We propose Branch-Solve-Merge (BSM), a Large Language Mode… ▽ More

    Submitted 7 June, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: NAACL 2024 (19 pages, 7 figures, 11 tables)

  48. arXiv:2310.12128  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning

    Authors: Abhay Zala, Han Lin, Jaemin Cho, Mohit Bansal

    Abstract: Text-to-image (T2I) generation has seen significant growth over the past few years. Despite this, there has been little work on generating diagrams with T2I models. A diagram is a symbolic/schematic representation that explains information using structurally rich and spatially complex visualizations (e.g., a dense combination of related objects, text labels, directional arrows/lines, etc.). Existi… ▽ More

    Submitted 15 July, 2024; v1 submitted 18 October, 2023; originally announced October 2023.

    Comments: COLM 2024; Project page: https://diagrammerGPT.github.io/

  49. arXiv:2310.10623  [pdf, other

    cs.CL cs.AI cs.LG

    Generating Summaries with Controllable Readability Levels

    Authors: Leonardo F. R. Ribeiro, Mohit Bansal, Markus Dreyer

    Abstract: Readability refers to how easily a reader can understand a written text. Several factors affect the readability level, such as the complexity of the text, its subject matter, and the reader's background knowledge. Generating summaries based on different readability levels is critical for enabling knowledge consumption by diverse audiences. However, current text generation approaches lack refined c… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

    Comments: Accepted as an EMNLP 2023 main paper

  50. arXiv:2310.07931  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning

    Authors: Adyasha Maharana, Prateek Yadav, Mohit Bansal

    Abstract: Analytical theories suggest that higher-quality data can lead to lower test errors in models trained on a fixed data budget. Moreover, a model can be trained on a lower compute budget without compromising performance if a dataset can be stripped of its redundancies. Coreset selection (or data pruning) seeks to select a subset of the training data so as to maximize the performance of models trained… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: 17 pages (Our code is available at https://github.com/adymaharana/d2pruning)