Search | arXiv e-print repository

Why do LLaVA Vision-Language Models Reply to Images in English?

Authors: Musashi Hinck, Carolin Holtermann, Matthew Lyle Olson, Florian Schneider, Sungduk Yu, Anahita Bhiwandiwalla, Anne Lauscher, Shaoyen Tseng, Vasudev Lal

Abstract: We uncover a surprising multilingual bias occurring in a popular class of multimodal vision-language models (VLMs). Including an image in the query to a LLaVA-style VLM significantly increases the likelihood of the model returning an English response, regardless of the language of the query. This paper investigates the causes of this loss with a two-pronged approach that combines extensive ablatio… ▽ More We uncover a surprising multilingual bias occurring in a popular class of multimodal vision-language models (VLMs). Including an image in the query to a LLaVA-style VLM significantly increases the likelihood of the model returning an English response, regardless of the language of the query. This paper investigates the causes of this loss with a two-pronged approach that combines extensive ablation of the design space with a mechanistic analysis of the models' internal representations of image and text inputs. Both approaches indicate that the issue stems in the language modelling component of the LLaVA model. Statistically, we find that switching the language backbone for a bilingual language model has the strongest effect on reducing this error. Mechanistically, we provide compelling evidence that visual inputs are not mapped to a similar space as text ones, and that intervening on intermediary attention layers can reduce this bias. Our findings provide important insights to researchers and engineers seeking to understand the crossover between multimodal and multilingual spaces, and contribute to the goal of developing capable and inclusive VLMs for non-English contexts. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: Pre-print

arXiv:2404.03118 [pdf, other]

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

Authors: Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar, Anahita Bhiwandiwalla, Shao-Yen Tseng, Matthew Lyle Olson, Yaniv Gurwicz, Chenfei Wu, Nan Duan, Vasudev Lal

Abstract: In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. However, understanding their internal mechanisms remains a complex task. Numerous advancements have been made in the field of explainability tools and mechanisms, y… ▽ More In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. However, understanding their internal mechanisms remains a complex task. Numerous advancements have been made in the field of explainability tools and mechanisms, yet there is still much to explore. In this work, we present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer, and assess the efficacy of the language model in grounding its output in the image. With our application, a user can systematically investigate the model and uncover system limitations, paving the way for enhancements in system capabilities. Finally, we present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA. △ Less

Submitted 24 June, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

arXiv:2404.01331 [pdf, other]

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Authors: Musashi Hinck, Matthew L. Olson, David Cobbley, Shao-Yen Tseng, Vasudev Lal

Abstract: We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pre… ▽ More We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models. Closer analysis of performance shows mixed effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent effects. We publicly release training recipes, code and weights for our models for the LLaVA-Gemma models. △ Less

Submitted 10 June, 2024; v1 submitted 29 March, 2024; originally announced April 2024.

Comments: CVPR 2024, MMFM workshop. Authors 1 and 2 contributed equally. Models available at https://huggingface.co/intel/llava-gemma-2b/ and https://huggingface.co/intel/llava-gemma-7b/ Training code at https://github.com/IntelLabs/multimodal_cognitive_ai/tree/main/LLaVA-Gemma

arXiv:2312.03642 [pdf, other]

Transformer-Powered Surrogates Close the ICF Simulation-Experiment Gap with Extremely Limited Data

Authors: Matthew L. Olson, Shusen Liu, Jayaraman J. Thiagarajan, Bogdan Kustowski, Weng-Keen Wong, Rushil Anirudh

Abstract: Recent advances in machine learning, specifically transformer architecture, have led to significant advancements in commercial domains. These powerful models have demonstrated superior capability to learn complex relationships and often generalize better to new data and problems. This paper presents a novel transformer-powered approach for enhancing prediction accuracy in multi-modal output scenar… ▽ More Recent advances in machine learning, specifically transformer architecture, have led to significant advancements in commercial domains. These powerful models have demonstrated superior capability to learn complex relationships and often generalize better to new data and problems. This paper presents a novel transformer-powered approach for enhancing prediction accuracy in multi-modal output scenarios, where sparse experimental data is supplemented with simulation data. The proposed approach integrates transformer-based architecture with a novel graph-based hyper-parameter optimization technique. The resulting system not only effectively reduces simulation bias, but also achieves superior prediction accuracy compared to the prior method. We demonstrate the efficacy of our approach on inertial confinement fusion experiments, where only 10 shots of real-world data are available, as well as synthetic versions of these experiments. △ Less

Submitted 28 May, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

Comments: MLST

arXiv:2303.10774 [pdf, other]

Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences between Pretrained Generative Models

Authors: Matthew L. Olson, Shusen Liu, Rushil Anirudh, Jayaraman J. Thiagarajan, Peer-Timo Bremer, Weng-Keen Wong

Abstract: Generative Adversarial Networks (GANs) are notoriously difficult to train especially for complex distributions and with limited data. This has driven the need for tools to audit trained networks in human intelligible format, for example, to identify biases or ensure fairness. Existing GAN audit tools are restricted to coarse-grained, model-data comparisons based on summary statistics such as FID o… ▽ More Generative Adversarial Networks (GANs) are notoriously difficult to train especially for complex distributions and with limited data. This has driven the need for tools to audit trained networks in human intelligible format, for example, to identify biases or ensure fairness. Existing GAN audit tools are restricted to coarse-grained, model-data comparisons based on summary statistics such as FID or recall. In this paper, we propose an alternative approach that compares a newly developed GAN against a prior baseline. To this end, we introduce Cross-GAN Auditing (xGA) that, given an established "reference" GAN and a newly proposed "client" GAN, jointly identifies intelligible attributes that are either common across both GANs, novel to the client GAN, or missing from the client GAN. This provides both users and model developers an intuitive assessment of similarity and differences between GANs. We introduce novel metrics to evaluate attribute-based GAN auditing approaches and use these metrics to demonstrate quantitatively that xGA outperforms baseline approaches. We also include qualitative results that illustrate the common, novel and missing attributes identified by xGA from GANs trained on a variety of image datasets. △ Less

Submitted 2 May, 2023; v1 submitted 19 March, 2023; originally announced March 2023.

Comments: CVPR 2023. Source code is available at https://github.com/mattolson93/cross_gan_auditing

arXiv:2302.12689 [pdf, other]

GANterfactual-RL: Understanding Reinforcement Learning Agents' Strategies through Visual Counterfactual Explanations

Authors: Tobias Huber, Maximilian Demmler, Silvan Mertes, Matthew L. Olson, Elisabeth André

Abstract: Counterfactual explanations are a common tool to explain artificial intelligence models. For Reinforcement Learning (RL) agents, they answer "Why not?" or "What if?" questions by illustrating what minimal change to a state is needed such that an agent chooses a different action. Generating counterfactual explanations for RL agents with visual input is especially challenging because of their large… ▽ More Counterfactual explanations are a common tool to explain artificial intelligence models. For Reinforcement Learning (RL) agents, they answer "Why not?" or "What if?" questions by illustrating what minimal change to a state is needed such that an agent chooses a different action. Generating counterfactual explanations for RL agents with visual input is especially challenging because of their large state spaces and because their decisions are part of an overarching policy, which includes long-term decision-making. However, research focusing on counterfactual explanations, specifically for RL agents with visual input, is scarce and does not go beyond identifying defective agents. It is unclear whether counterfactual explanations are still helpful for more complex tasks like analyzing the learned strategies of different agents or choosing a fitting agent for a specific task. We propose a novel but simple method to generate counterfactual explanations for RL agents by formulating the problem as a domain transfer problem which allows the use of adversarial learning techniques like StarGAN. Our method is fully model-agnostic and we demonstrate that it outperforms the only previous method in several computational metrics. Furthermore, we show in a user study that our method performs best when analyzing which strategies different agents pursue. △ Less

Submitted 24 February, 2023; originally announced February 2023.

arXiv:2209.13129 [pdf, other]

Deep Generative Multimedia Children's Literature

Authors: Matthew L. Olson

Abstract: Artistic work leveraging Machine Learning techniques is an increasingly popular endeavour for those with a creative lean. However, most work is done in a single domain: text, images, music, etc. In this work, I design a system for a machine learning created multimedia experience, specifically in the genre of children's literature. We detail the process for exclusively using publicly available pret… ▽ More Artistic work leveraging Machine Learning techniques is an increasingly popular endeavour for those with a creative lean. However, most work is done in a single domain: text, images, music, etc. In this work, I design a system for a machine learning created multimedia experience, specifically in the genre of children's literature. We detail the process for exclusively using publicly available pretrained deep neural network based models, I present multiple examples of the work my system creates, and I explore the problems associated in this area of creative work. △ Less

Submitted 10 January, 2023; v1 submitted 26 September, 2022; originally announced September 2022.

Comments: AAAI 2023 Workshop on Creative AI Across Modalities

arXiv:2203.02067 [pdf, ps, other]

Heat transport in a hierarchy of reduced-order convection models

Authors: Matthew L. Olson, Charles R. Doering

Abstract: Reduced-order models (ROMs) are systems of ordinary differential equations (ODEs) designed to approximate the dynamics of partial differential equations (PDEs). In this work, a distinguished hierarchy of ROMs is constructed for Rayleigh's 1916 model of natural thermal convection. These models are distinguished in the sense that they preserve energy and vorticity balances derived from the governing… ▽ More Reduced-order models (ROMs) are systems of ordinary differential equations (ODEs) designed to approximate the dynamics of partial differential equations (PDEs). In this work, a distinguished hierarchy of ROMs is constructed for Rayleigh's 1916 model of natural thermal convection. These models are distinguished in the sense that they preserve energy and vorticity balances derived from the governing equations, and each is capable of modeling zonal flow. Various models from the hierarchy are analyzed to determine the maximal heat transport in a given model, measured by the dimensionless Nusselt number, for a given Rayleigh number. Lower bounds on the maximal heat transport are ascertained by computing the Nusselt number among equilibria of the chosen model using numerical continuation. A method known as sum-of-squares optimization is applied to construct upper bounds on the time-averaged Nusselt number. In this case, the sum-of-squares approach involves constructing a polynomial quantity whose global nonnegativity implies the upper bound along all solutions to a chosen ROM. The minimum such bound is determined through a type of convex optimization called semidefinite programming. For the ROMs studied in this work, the Nusselt number is maximized by equilibria whenever the Rayleigh number is sufficiently small. In this range of Rayleigh number, the equilibria maximizing heat transport are those that bifurcate first from the zero state. Analyzing this primary equilibrium branch provides a possible mechanism for the increase in heat transport near the onset of convection. △ Less

Submitted 3 March, 2022; originally announced March 2022.

Comments: 36 pages

arXiv:2108.08000 [pdf, other]

Contrastive Identification of Covariate Shift in Image Data

Authors: Matthew L. Olson, Thuy-Vy Nguyen, Gaurav Dixit, Neale Ratzlaff, Weng-Keen Wong, Minsuk Kahng

Abstract: Identifying covariate shift is crucial for making machine learning systems robust in the real world and for detecting training data biases that are not reflected in test data. However, detecting covariate shift is challenging, especially when the data consists of high-dimensional images, and when multiple types of localized covariate shift affect different subspaces of the data. Although automated… ▽ More Identifying covariate shift is crucial for making machine learning systems robust in the real world and for detecting training data biases that are not reflected in test data. However, detecting covariate shift is challenging, especially when the data consists of high-dimensional images, and when multiple types of localized covariate shift affect different subspaces of the data. Although automated techniques can be used to detect the existence of covariate shift, our goal is to help human users characterize the extent of covariate shift in large image datasets with interfaces that seamlessly integrate information obtained from the detection algorithms. In this paper, we design and evaluate a new visual interface that facilitates the comparison of the local distributions of training and test data. We conduct a quantitative user study on multi-attribute facial data to compare two different learned low-dimensional latent representations (pretrained ImageNet CNN vs. density ratio) and two user analytic workflows (nearest-neighbor vs. cluster-to-cluster). Our results indicate that the latent representation of our density ratio model, combined with a nearest-neighbor comparison, is the most effective at helping humans identify covariate shift. △ Less

Submitted 19 August, 2021; v1 submitted 18 August, 2021; originally announced August 2021.

Comments: IEEE VIS 2021

arXiv:2101.12446 [pdf, other]

doi 10.1016/j.artint.2021.103455

Counterfactual State Explanations for Reinforcement Learning Agents via Generative Deep Learning

Authors: Matthew L. Olson, Roli Khanna, Lawrence Neal, Fuxin Li, Weng-Keen Wong

Abstract: Counterfactual explanations, which deal with "why not?" scenarios, can provide insightful explanations to an AI agent's behavior. In this work, we focus on generating counterfactual explanations for deep reinforcement learning (RL) agents which operate in visual input environments like Atari. We introduce counterfactual state explanations, a novel example-based approach to counterfactual explanati… ▽ More Counterfactual explanations, which deal with "why not?" scenarios, can provide insightful explanations to an AI agent's behavior. In this work, we focus on generating counterfactual explanations for deep reinforcement learning (RL) agents which operate in visual input environments like Atari. We introduce counterfactual state explanations, a novel example-based approach to counterfactual explanations based on generative deep learning. Specifically, a counterfactual state illustrates what minimal change is needed to an Atari game image such that the agent chooses a different action. We also evaluate the effectiveness of counterfactual states on human participants who are not machine learning experts. Our first user study investigates if humans can discern if the counterfactual state explanations are produced by the actual game or produced by a generative deep learning approach. Our second user study investigates if counterfactual state explanations can help non-expert participants identify a flawed agent; we compare against a baseline approach based on a nearest neighbor explanation which uses images from the actual game. Our results indicate that counterfactual state explanations have sufficient fidelity to the actual game images to enable non-experts to more effectively identify a flawed RL agent compared to the nearest neighbor baseline and to having no explanation at all. △ Less

Submitted 29 January, 2021; originally announced January 2021.

Comments: Full source code available at https://github.com/mattolson93/counterfactual-state-explanations

Journal ref: Artificial Intelligence, 2021, 103455, ISSN 0004-3702

arXiv:2004.07204 [pdf, ps, other]

doi 10.1016/j.physd.2020.132748

Heat transport bounds for a truncated model of Rayleigh-Bénard convection via polynomial optimization

Authors: Matthew L. Olson, David Goluskin, William W. Schultz, Charles R. Doering

Abstract: Upper bounds on time-averaged heat transport are obtained for an eight-mode Galerkin truncation of Rayleigh's 1916 model of natural thermal convection. Bounds for the ODE model---an extension of Lorenz's three-ODE system---are derived by constructing auxiliary functions that satisfy sufficient conditions wherein certain polynomial expressions must be nonnegative. Such conditions are enforced by re… ▽ More Upper bounds on time-averaged heat transport are obtained for an eight-mode Galerkin truncation of Rayleigh's 1916 model of natural thermal convection. Bounds for the ODE model---an extension of Lorenz's three-ODE system---are derived by constructing auxiliary functions that satisfy sufficient conditions wherein certain polynomial expressions must be nonnegative. Such conditions are enforced by requiring the polynomial expressions to admit sum-of-squares representations, allowing the resulting bounds to be minimized using semidefinite programming. Sharp or nearly sharp bounds on mean heat transport are computed numerically for numerous values of the model parameters: the Rayleigh and Prandtl numbers and the domain aspect ratio. In all cases where the Rayleigh number is small enough for the ODE model to be quantitatively close to the PDE model, mean heat transport is maximized by steady states. In some cases at larger Rayleigh number, time-periodic states maximize heat transport in the truncated model. Analytical parameter-dependent bounds are derived using quadratic auxiliary functions, and they are sharp for sufficiently small Rayleigh numbers. △ Less

Submitted 1 September, 2020; v1 submitted 15 April, 2020; originally announced April 2020.

Comments: 37 pages; v2: minor revisions

Journal ref: Physics D 415, 132748 (2020)

arXiv:1909.12969 [pdf, other]

Counterfactual States for Atari Agents via Generative Deep Learning

Authors: Matthew L. Olson, Lawrence Neal, Fuxin Li, Weng-Keen Wong

Abstract: Although deep reinforcement learning agents have produced impressive results in many domains, their decision making is difficult to explain to humans. To address this problem, past work has mainly focused on explaining why an action was chosen in a given state. A different type of explanation that is useful is a counterfactual, which deals with "what if?" scenarios. In this work, we introduce the… ▽ More Although deep reinforcement learning agents have produced impressive results in many domains, their decision making is difficult to explain to humans. To address this problem, past work has mainly focused on explaining why an action was chosen in a given state. A different type of explanation that is useful is a counterfactual, which deals with "what if?" scenarios. In this work, we introduce the concept of a counterfactual state to help humans gain a better understanding of what would need to change (minimally) in an Atari game image for the agent to choose a different action. We introduce a novel method to create counterfactual states from a generative deep learning architecture. In addition, we evaluate the effectiveness of counterfactual states on human participants who are not machine learning experts. Our user study results suggest that our generated counterfactual states are useful in helping non-expert participants gain a better understanding of an agent's decision making process. △ Less

Submitted 27 September, 2019; originally announced September 2019.

Comments: IJCAI XAI Workshop 2019

Showing 1–12 of 12 results for author: Olson, M L