(Translated by https://www.hiragana.jp/)
GitHub - BradyFU/Awesome-Multimodal-Large-Language-Models: :sparkles::sparkles:Latest Advances on Multimodal Large Language Models
Skip to content

BradyFU/Awesome-Multimodal-Large-Language-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Awesome-Multimodal-Large-Language-Models

Our MLLM works

🔥🔥🔥 A Survey on Multimodal Large Language Models
Project Page [This Page] | Paper

The first comprehensive survey for Multimodal Large Language Models (MLLMs). ✨

Welcome to add WeChat ID (wmd_ustc) to join our MLLM communication group! 🌟


🔥🔥🔥 VITA: Towards Open-Source Interactive Omni Multimodal LLM

🔥🔥🔥 [2024.09.06] The training code, deployment code, and model weights have been released. Long wait!

We are announcing VITA, the first-ever open-source Multimodal LLM that can process Video, Image, Text, and Audio, and meanwhile has an advanced multimodal interactive experience.

Omni Multimodal Understanding. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. ✨

Non-awakening Interaction. VITA can be activated and respond to user audio questions in the environment without the need for a wake-up word or button. ✨

Audio Interrupt Interaction. VITA is able to simultaneously track and filter external queries in real-time. This allows users to interrupt the model's generation at any time with new questions, and VITA will respond to the new query accordingly. ✨


🔥🔥🔥 Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

[2024.06.03] We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis! 🌟

It applies to both image MLLMs, i.e., generalizing to multiple images, and video MLLMs. Our leaderboard involes SOTA models like Gemini 1.5 Pro, GPT-4o, GPT-4V, LLaVA-NeXT-Video, InternVL-Chat-V1.5, and Qwen-VL-Max. 🌟

It includes both short- (< 2min), medium- (4min~15min), and long-term (30min~60min) videos, ranging from 11 seconds to 1 hour. ✨

All data are newly collected and annotated by humans, not from any existing video dataset. ✨


🔥🔥🔥 MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Project Page [Leaderboards] | Paper | ✒️ Citation

A comprehensive evaluation benchmark for MLLMs. Now the leaderboards include 50+ advanced models, such as Qwen-VL-Max, Gemini Pro, and GPT-4V. ✨

If you want to add your model in our leaderboards, please feel free to email bradyfu24@gmail.com. We will update the leaderboards in time. ✨

Download MME 🌟🌟

The benchmark dataset is collected by Xiamen University for academic research only. You can email yongdongluo@stu.xmu.edu.cn to obtain the dataset, according to the following requirement.

Requirement: A real-name system is encouraged for better academic communication. Your email suffix needs to match your affiliation, such as xx@stu.xmu.edu.cn and Xiamen University. Otherwise, you need to explain why. Please include the information bellow when sending your application email.

Name: (tell us who you are.)
Affiliation: (the name/url of your university or company)
Job Title: (e.g., professor, PhD, and researcher)
Email: (your email address)
How to use: (only for non-commercial use)


📑 If you find our projects helpful to your research, please consider citing:

@article{fu2023mme,
  title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
  author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and others},
  journal={arXiv preprint arXiv:2306.13394},
  year={2023}
}

@article{fu2024vita,
  title={VITA: Towards Open-Source Interactive Omni Multimodal LLM},
  author={Fu, Chaoyou and Lin, Haojia and Long, Zuwei and Shen, Yunhang and Zhao, Meng and Zhang, Yifan and Wang, Xiong and Yin, Di and Ma, Long and Zheng, Xiawu and He, Ran and Ji, Rongrong and Wu, Yunsheng and Shan, Caifeng and Sun, Xing},
  journal={arXiv preprint arXiv:2408.05211},
  year={2024}
}

@article{fu2024video,
  title={Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis},
  author={Fu, Chaoyou and Dai, Yuhan and Luo, Yondong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and others},
  journal={arXiv preprint arXiv:2405.21075},
  year={2024}
}

@article{yin2023survey,
  title={A survey on multimodal large language models},
  author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong},
  journal={arXiv preprint arXiv:2306.13549},
  year={2023}
}


Table of Contents


Awesome Papers

Multimodal Instruction Tuning

Title Venue Date Code Demo
Star
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
arXiv 2024-09-18 Github Demo
Star
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
arXiv 2024-09-04 Github -
Star
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
arXiv 2024-08-28 Github Demo
Star
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
arXiv 2024-08-09 Github -
Star
VITA: Towards Open-Source Interactive Omni Multimodal LLM
arXiv 2024-08-09 Github -
Star
LLaVA-OneVision: Easy Visual Task Transfer
arXiv 2024-08-06 Github Demo
Star
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
arXiv 2024-08-03 Github Demo
VILA^2: VILA Augmented VILA arXiv 2024-07-24 - -
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models arXiv 2024-07-22 - -
EVLM: An Efficient Vision-Language Model for Visual Understanding arXiv 2024-07-19 - -
Star
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
arXiv 2024-07-03 Github Demo
Star
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
arXiv 2024-06-27 Github Local Demo
Star
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
arXiv 2024-06-24 Github Local Demo
Star
Long Context Transfer from Language to Vision
arXiv 2024-06-24 Github Local Demo
Star
Unveiling Encoder-Free Vision-Language Models
arXiv 2024-06-17 Github Local Demo
Star
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
arXiv 2024-06-12 Github -
Star
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
arXiv 2024-06-11 Github Local Demo
Star
Parrot: Multilingual Visual Instruction Tuning
arXiv 2024-06-04 Github -
Star
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
arXiv 2024-05-31 Github -
Star
Matryoshka Query Transformer for Large Vision-Language Models
arXiv 2024-05-29 Github Demo
Star
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
arXiv 2024-05-24 Github -
Star
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
arXiv 2024-05-24 Github Demo
Star
Libra: Building Decoupled Vision System on Large Language Models
ICML 2024-05-16 Github Local Demo
Star
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
arXiv 2024-05-09 Github Local Demo
Star
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
arXiv 2024-04-25 Github Demo
Star
Graphic Design with Large Multimodal Model
arXiv 2024-04-22 Github -
Star
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
arXiv 2024-04-09 Github Demo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs arXiv 2024-04-08 - -
Star
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
CVPR 2024-04-08 Github -
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model ACM TKDD 2024-03-28 - -
Star
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
arXiv 2024-03-27 Github Demo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training arXiv 2024-03-14 - -
Star
MoAI: Mixture of All Intelligence for Large Language and Vision Models
arXiv 2024-03-12 Github Local Demo
Star
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
arXiv 2024-03-07 Github Demo
Star
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
arXiv 2024-02-29 Github -
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation CVPR 2024-02-26 Coming soon Coming soon
Star
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
arXiv 2024-02-19 Github -
Star
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
arXiv 2024-02-18 Github -
Star
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
arXiv 2024-02-18 Github Demo
Star
CoLLaVO: Crayon Large Language and Vision mOdel
arXiv 2024-02-17 Github -
Star
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
arXiv 2024-02-06 Github -
Star
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
arXiv 2024-02-06 Github -
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study arXiv 2024-01-31 Coming soon -
Star
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
Blog 2024-01-30 Github Demo
Star
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
arXiv 2024-01-29 Github Demo
Star
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
arXiv 2024-01-29 Github Demo
Star
Yi-VL
- 2024-01-23 Github Local Demo
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities arXiv 2024-01-22 - -
Star
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
arXiv 2023-12-28 Github -
Star
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
CVPR 2023-12-21 Github Demo
Star
Osprey: Pixel Understanding with Visual Instruction Tuning
CVPR 2023-12-15 Github Demo
Star
CogAgent: A Visual Language Model for GUI Agents
arXiv 2023-12-14 Github Coming soon
Pixel Aligned Language Models arXiv 2023-12-14 Coming soon -
See, Say, and Segment: Teaching LMMs to Overcome False Premises arXiv 2023-12-13 Coming soon -
Star
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
ECCV 2023-12-11 Github Demo
Star
Honeybee: Locality-enhanced Projector for Multimodal LLM
CVPR 2023-12-11 Github -
Gemini: A Family of Highly Capable Multimodal Models Google 2023-12-06 - -
Star
OneLLM: One Framework to Align All Modalities with Language
arXiv 2023-12-06 Github Demo
Star
Lenna: Language Enhanced Reasoning Detection Assistant
arXiv 2023-12-05 Github -
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding arXiv 2023-12-04 - -
Star
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
arXiv 2023-12-04 Github Local Demo
Star
Making Large Multimodal Models Understand Arbitrary Visual Prompts
CVPR 2023-12-01 Github Demo
Star
Dolphins: Multimodal Language Model for Driving
arXiv 2023-12-01 Github -
Star
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
arXiv 2023-11-30 Github Coming soon
Star
VTimeLLM: Empower LLM to Grasp Video Moments
arXiv 2023-11-30 Github Local Demo
Star
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
arXiv 2023-11-30 Github -
Star
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
arXiv 2023-11-28 Github Coming soon
Star
LLMGA: Multimodal Large Language Model based Generation Assistant
arXiv 2023-11-27 Github Demo
Star
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
arXiv 2023-11-27 Github -
Star
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
arXiv 2023-11-21 Github Demo
Star
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
arXiv 2023-11-20 Github -
Star
An Embodied Generalist Agent in 3D World
arXiv 2023-11-18 Github Demo
Star
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
arXiv 2023-11-16 Github Demo
Star
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
CVPR 2023-11-14 Github -
Star
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
arXiv 2023-11-13 Github -
Star
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
arXiv 2023-11-13 Github Demo
Star
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
CVPR 2023-11-11 Github Demo
Star
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
arXiv 2023-11-09 Github Demo
Star
NExT-Chat: An LMM for Chat, Detection and Segmentation
arXiv 2023-11-08 Github Local Demo
Star
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
arXiv 2023-11-07 Github Demo
Star
OtterHD: A High-Resolution Multi-modality Model
arXiv 2023-11-07 Github -
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding arXiv 2023-11-06 Coming soon -
Star
GLaMM: Pixel Grounding Large Multimodal Model
CVPR 2023-11-06 Github Demo
Star
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
arXiv 2023-11-02 Github -
Star
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
arXiv 2023-10-14 Github Local Demo
Star
Ferret: Refer and Ground Anything Anywhere at Any Granularity
arXiv 2023-10-11 Github -
Star
CogVLM: Visual Expert For Large Language Models
arXiv 2023-10-09 Github Demo
Star
Improved Baselines with Visual Instruction Tuning
arXiv 2023-10-05 Github Demo
Star
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
ICLR 2023-10-03 Github Demo
Star
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
arXiv 2023-10-01 Github -
Star
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
arXiv 2023-10-01 Github Local Demo
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model arXiv 2023-09-27 - -
Star
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
arXiv 2023-09-26 Github Local Demo
Star
DreamLLM: Synergistic Multimodal Comprehension and Creation
ICLR 2023-09-20 Github Coming soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models arXiv 2023-09-18 Coming soon -
Star
TextBind: Multi-turn Interleaved Multimodal Instruction-following
arXiv 2023-09-14 Github Demo
Star
NExT-GPT: Any-to-Any Multimodal LLM
arXiv 2023-09-11 Github Demo
Star
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
arXiv 2023-09-13 Github -
Star
ImageBind-LLM: Multi-modality Instruction Tuning
arXiv 2023-09-07 Github Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning arXiv 2023-09-05 - -
Star
PointLLM: Empowering Large Language Models to Understand Point Clouds
arXiv 2023-08-31 Github Demo
Star
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
arXiv 2023-08-31 Github Local Demo
Star
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
arXiv 2023-08-25 Github -
Star
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
arXiv 2023-08-25 Github Demo
Star
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
arXiv 2023-08-24 Github Demo
Star
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
ICLR 2023-08-23 Github Demo
Star
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
arXiv 2023-08-20 Github -
Star
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
arXiv 2023-08-19 Github Demo
Star
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
arXiv 2023-08-08 Github -
Star
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
ICLR 2023-08-03 Github Demo
Star
LISA: Reasoning Segmentation via Large Language Model
arXiv 2023-08-01 Github Demo
Star
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
arXiv 2023-07-31 Github Local Demo
Star
3D-LLM: Injecting the 3D World into Large Language Models
arXiv 2023-07-24 Github -
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
arXiv 2023-07-18 - Demo
Star
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
arXiv 2023-07-17 Github Demo
Star
SVIT: Scaling up Visual Instruction Tuning
arXiv 2023-07-09 Github -
Star
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
arXiv 2023-07-07 Github Demo
Star
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
arXiv 2023-07-05 Github -
Star
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
arXiv 2023-07-04 Github Demo
Star
Visual Instruction Tuning with Polite Flamingo
arXiv 2023-07-03 Github Demo
Star
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
arXiv 2023-06-29 Github Demo
Star
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
arXiv 2023-06-27 Github Demo
Star
MotionGPT: Human Motion as a Foreign Language
arXiv 2023-06-26 Github -
Star
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
arXiv 2023-06-15 Github Coming soon
Star
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv 2023-06-11 Github Demo
Star
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
arXiv 2023-06-08 Github Demo
Star
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
arXiv 2023-06-08 Github Demo
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning arXiv 2023-06-07 - -
Star
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
arXiv 2023-06-05 Github Demo
Star
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
arXiv 2023-06-01 Github -
Star
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
arXiv 2023-05-30 Github Demo
Star
PandaGPT: One Model To Instruction-Follow Them All
arXiv 2023-05-25 Github Demo
Star
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
arXiv 2023-05-25 Github -
Star
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
arXiv 2023-05-24 Github Local Demo
Star
DetGPT: Detect What You Need via Reasoning
arXiv 2023-05-23 Github Demo
Star
Pengi: An Audio Language Model for Audio Tasks
NeurIPS 2023-05-19 Github -
Star
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
arXiv 2023-05-18 Github -
Star
Listen, Think, and Understand
arXiv 2023-05-18 Github Demo
Star
VisualGLM-6B
- 2023-05-17 Github Local Demo
Star
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
arXiv 2023-05-17 Github -
Star
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
arXiv 2023-05-11 Github Local Demo
Star
VideoChat: Chat-Centric Video Understanding
arXiv 2023-05-10 Github Demo
Star
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
arXiv 2023-05-08 Github Demo
Star
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
arXiv 2023-05-07 Github -
Star
LMEye: An Interactive Perception Network for Large Language Models
arXiv 2023-05-05 Github Local Demo
Star
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
arXiv 2023-04-28 Github Demo
Star
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
arXiv 2023-04-27 Github Demo
Star
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
arXiv 2023-04-20 Github -
Star
Visual Instruction Tuning
NeurIPS 2023-04-17 GitHub Demo
Star
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
ICLR 2023-03-28 Github Demo
Star
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
ACL 2022-12-21 Github -

Multimodal Hallucination

Title Venue Date Code Demo
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs arXiv 2024-09-20 Link -
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation arXiv 2024-08-01 - -
Star
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
ECCV 2024-07-31 Github -
Star
Evaluating and Analyzing Relationship Hallucinations in LVLMs
ICML 2024-06-24 Github -
Star
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
arXiv 2024-06-18 Github -
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models arXiv 2024-06-04 Coming soon -
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap arXiv 2024-05-24 Coming soon -
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback arXiv 2024-04-22 - -
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding arXiv 2024-03-27 - -
Star
What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models
arXiv 2024-03-20 Github -
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization arXiv 2024-03-13 - -
Star
Debiasing Multimodal Large Language Models
arXiv 2024-03-08 Github -
Star
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
arXiv 2024-03-01 Github -
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding arXiv 2024-02-28 - -
Star
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective
arXiv 2024-02-22 Github -
Star
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models
arXiv 2024-02-18 Github -
Star
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs
arXiv 2024-02-06 Github -
Star
Unified Hallucination Detection for Multimodal Large Language Models
arXiv 2024-02-05 Github -
A Survey on Hallucination in Large Vision-Language Models arXiv 2024-02-01 - -
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models arXiv 2024-01-18 - -
Star
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
arXiv 2023-12-12 Github -
Star
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations
arXiv 2023-12-06 Github -
Star
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites
arXiv 2023-12-04 Github -
Star
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
arXiv 2023-12-01 Github Demo
Star
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
CVPR 2023-11-29 Github -
Star
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
CVPR 2023-11-28 Github -
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization arXiv 2023-11-28 Github Comins Soon
Mitigating Hallucination in Visual Language Models with Visual Supervision arXiv 2023-11-27 - -
Star
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
arXiv 2023-11-22 Github -
Star
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
arXiv 2023-11-13 Github -
Star
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
arXiv 2023-11-02 Github -
Star
Woodpecker: Hallucination Correction for Multimodal Large Language Models
arXiv 2023-10-24 Github Demo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models arXiv 2023-10-09 - -
Star
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption
arXiv 2023-10-03 Github -
Star
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
ICLR 2023-10-01 Github -
Star
Aligning Large Multimodal Models with Factually Augmented RLHF
arXiv 2023-09-25 Github Demo
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models arXiv 2023-09-07 - -
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning arXiv 2023-09-05 - -
Star
Evaluation and Analysis of Hallucination in Large Vision-Language Models
arXiv 2023-08-29 Github -
Star
VIGC: Visual Instruction Generation and Correction
arXiv 2023-08-24 Github Demo
Detecting and Preventing Hallucinations in Large Vision Language Models arXiv 2023-08-11 - -
Star
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
ICLR 2023-06-26 Github Demo
Star
Evaluating Object Hallucination in Large Vision-Language Models
EMNLP 2023-05-17 Github -

Multimodal In-Context Learning

Title Venue Date Code Demo
Visual In-Context Learning for Large Vision-Language Models arXiv 2024-02-18 - -
Star
Can MLLMs Perform Text-to-Image In-Context Learning?
arXiv 2024-02-02 Github -
Star
Generative Multimodal Models are In-Context Learners
CVPR 2023-12-20 Github Demo
Hijacking Context in Large Multi-modal Models arXiv 2023-12-07 - -
Towards More Unified In-context Visual Understanding arXiv 2023-12-05 - -
Star
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
arXiv 2023-09-14 Github Demo
Star
Link-Context Learning for Multimodal LLMs
arXiv 2023-08-15 Github Demo
Star
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
arXiv 2023-08-02 Github Demo
Star
Med-Flamingo: a Multimodal Medical Few-shot Learner
arXiv 2023-07-27 Github Local Demo
Star
Generative Pretraining in Multimodality
ICLR 2023-07-11 Github Demo
AVIS: Autonomous Visual Information Seeking with Large Language Models arXiv 2023-06-13 - -
Star
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
arXiv 2023-06-08 Github Demo
Star
Exploring Diverse In-Context Configurations for Image Captioning
NeurIPS 2023-05-24 Github -
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv 2023-04-19 Github Demo
Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
arXiv 2023-03-30 Github Demo
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023-03-20 Github Demo
Star
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
ICCV 2023-03-09 Github -
Star
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
CVPR 2023-03-03 Github -
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2022-11-18 Github Local Demo
Star
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
AAAI 2022-06-28 Github -
Star
Flamingo: a Visual Language Model for Few-Shot Learning
NeurIPS 2022-04-29 Github Demo
Multimodal Few-Shot Learning with Frozen Language Models NeurIPS 2021-06-25 - -

Multimodal Chain-of-Thought

Title Venue Date Code Demo
Star
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
arXiv 2024-04-24 Github Local Demo
Star
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models
arXiv 2024-03-25 Github Local Demo
Star
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
NeurIPS 2023-10-25 Github -
Star
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
arXiv 2023-06-27 Github Demo
Star
Explainable Multimodal Emotion Reasoning
arXiv 2023-06-27 Github -
Star
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
arXiv 2023-05-24 Github -
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction arXiv 2023-05-23 - -
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering arXiv 2023-05-05 - -
Star
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
arXiv 2023-05-04 Github Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings arXiv 2023-05-03 Coming soon -
Star