Awesome-Multimodal-Large-Language-Models

Our MLLM works

🔥🔥🔥 A Survey on Multimodal Large Language Models
Project Page [This Page] | Paper

The first comprehensive survey for Multimodal Large Language Models (MLLMs). ✨

Welcome to add WeChat ID (wmd_ustc) to join our MLLM communication group! 🌟

🔥🔥🔥 VITA: Towards Open-Source Interactive Omni Multimodal LLM

[🍎 Project Page] [📖 arXiv Paper] [🌟 GitHub] [🤗 Hugging Face] [💬 WeChat (微ほろ信しん)]

🔥🔥🔥 [2024.09.06] The training code, deployment code, and model weights have been released. Long wait!

We are announcing VITA, the first-ever open-source Multimodal LLM that can process Video, Image, Text, and Audio, and meanwhile has an advanced multimodal interactive experience.

Omni Multimodal Understanding. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. ✨

Non-awakening Interaction. VITA can be activated and respond to user audio questions in the environment without the need for a wake-up word or button. ✨

Audio Interrupt Interaction. VITA is able to simultaneously track and filter external queries in real-time. This allows users to interrupt the model's generation at any time with new questions, and VITA will respond to the new query accordingly. ✨

🔥🔥🔥 Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

[🍎 Project Page] [📖 arXiv Paper] [📊 Dataset][🏆 Leaderboard]

[2024.06.03] We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis! 🌟

It applies to both image MLLMs, i.e., generalizing to multiple images, and video MLLMs. Our leaderboard involes SOTA models like Gemini 1.5 Pro, GPT-4o, GPT-4V, LLaVA-NeXT-Video, InternVL-Chat-V1.5, and Qwen-VL-Max. 🌟

It includes both short- (< 2min), medium- (4min~15min), and long-term (30min~60min) videos, ranging from 11 seconds to 1 hour. ✨

All data are newly collected and annotated by humans, not from any existing video dataset. ✨

🔥🔥🔥 MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Project Page [Leaderboards] | Paper | ✒️ Citation

A comprehensive evaluation benchmark for MLLMs. Now the leaderboards include 50+ advanced models, such as Qwen-VL-Max, Gemini Pro, and GPT-4V. ✨

If you want to add your model in our leaderboards, please feel free to email bradyfu24@gmail.com. We will update the leaderboards in time. ✨

Download MME 🌟🌟

The benchmark dataset is collected by Xiamen University for academic research only. You can email yongdongluo@stu.xmu.edu.cn to obtain the dataset, according to the following requirement.

Requirement: A real-name system is encouraged for better academic communication. Your email suffix needs to match your affiliation, such as xx@stu.xmu.edu.cn and Xiamen University. Otherwise, you need to explain why. Please include the information bellow when sending your application email.

Name: (tell us who you are.)
Affiliation: (the name/url of your university or company)
Job Title: (e.g., professor, PhD, and researcher)
Email: (your email address)
How to use: (only for non-commercial use)

📑 If you find our projects helpful to your research, please consider citing:

@article{fu2023mme,
  title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
  author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and others},
  journal={arXiv preprint arXiv:2306.13394},
  year={2023}
}

@article{fu2024vita,
  title={VITA: Towards Open-Source Interactive Omni Multimodal LLM},
  author={Fu, Chaoyou and Lin, Haojia and Long, Zuwei and Shen, Yunhang and Zhao, Meng and Zhang, Yifan and Wang, Xiong and Yin, Di and Ma, Long and Zheng, Xiawu and He, Ran and Ji, Rongrong and Wu, Yunsheng and Shan, Caifeng and Sun, Xing},
  journal={arXiv preprint arXiv:2408.05211},
  year={2024}
}

@article{fu2024video,
  title={Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis},
  author={Fu, Chaoyou and Dai, Yuhan and Luo, Yondong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and others},
  journal={arXiv preprint arXiv:2405.21075},
  year={2024}
}

@article{yin2023survey,
  title={A survey on multimodal large language models},
  author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong},
  journal={arXiv preprint arXiv:2306.13549},
  year={2023}
}

Table of Contents

Awesome Papers

Multimodal Instruction Tuning

Title	Venue	Date	Code	Demo
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	arXiv	2024-09-18	Github	Demo
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture	arXiv	2024-09-04	Github	-
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders	arXiv	2024-08-28	Github	Demo
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	arXiv	2024-08-09	Github	-
VITA: Towards Open-Source Interactive Omni Multimodal LLM	arXiv	2024-08-09	Github	-
LLaVA-OneVision: Easy Visual Task Transfer	arXiv	2024-08-06	Github	Demo
MiniCPM-V: A GPT-4V Level MLLM on Your Phone	arXiv	2024-08-03	Github	Demo
VILA^2: VILA Augmented VILA	arXiv	2024-07-24	-	-
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models	arXiv	2024-07-22	-	-
EVLM: An Efficient Vision-Language Model for Visual Understanding	arXiv	2024-07-19	-	-
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	arXiv	2024-07-03	Github	Demo
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding	arXiv	2024-06-27	Github	Local Demo
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	arXiv	2024-06-24	Github	Local Demo
Long Context Transfer from Language to Vision	arXiv	2024-06-24	Github	Local Demo
Unveiling Encoder-Free Vision-Language Models	arXiv	2024-06-17	Github	Local Demo
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models	arXiv	2024-06-12	Github	-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	arXiv	2024-06-11	Github	Local Demo
Parrot: Multilingual Visual Instruction Tuning	arXiv	2024-06-04	Github	-
Ovis: Structural Embedding Alignment for Multimodal Large Language Model	arXiv	2024-05-31	Github	-
Matryoshka Query Transformer for Large Vision-Language Models	arXiv	2024-05-29	Github	Demo
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	arXiv	2024-05-24	Github	-
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models	arXiv	2024-05-24	Github	Demo
Libra: Building Decoupled Vision System on Large Language Models	ICML	2024-05-16	Github	Local Demo
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	arXiv	2024-05-09	Github	Local Demo
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites	arXiv	2024-04-25	Github	Demo
Graphic Design with Large Multimodal Model	arXiv	2024-04-22	Github	-
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD	arXiv	2024-04-09	Github	Demo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs	arXiv	2024-04-08	-	-
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	CVPR	2024-04-08	Github	-
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model	ACM TKDD	2024-03-28	-	-
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	arXiv	2024-03-27	Github	Demo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training	arXiv	2024-03-14	-	-
MoAI: Mixture of All Intelligence for Large Language and Vision Models	arXiv	2024-03-12	Github	Local Demo
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document	arXiv	2024-03-07	Github	Demo
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World	arXiv	2024-02-29	Github	-
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation	CVPR	2024-02-26	Coming soon	Coming soon
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	arXiv	2024-02-19	Github	-
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning	arXiv	2024-02-18	Github	-
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model	arXiv	2024-02-18	Github	Demo
CoLLaVO: Crayon Large Language and Vision mOdel	arXiv	2024-02-17	Github	-
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations	arXiv	2024-02-06	Github	-
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model	arXiv	2024-02-06	Github	-
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study	arXiv	2024-01-31	Coming soon	-
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge	Blog	2024-01-30	Github	Demo
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models	arXiv	2024-01-29	Github	Demo
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model	arXiv	2024-01-29	Github	Demo
Yi-VL	-	2024-01-23	Github	Local Demo
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	arXiv	2024-01-22	-	-
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices	arXiv	2023-12-28	Github	-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	CVPR	2023-12-21	Github	Demo
Osprey: Pixel Understanding with Visual Instruction Tuning	CVPR	2023-12-15	Github	Demo
CogAgent: A Visual Language Model for GUI Agents	arXiv	2023-12-14	Github	Coming soon
Pixel Aligned Language Models	arXiv	2023-12-14	Coming soon	-
See, Say, and Segment: Teaching LMMs to Overcome False Premises	arXiv	2023-12-13	Coming soon	-
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models	ECCV	2023-12-11	Github	Demo
Honeybee: Locality-enhanced Projector for Multimodal LLM	CVPR	2023-12-11	Github	-
Gemini: A Family of Highly Capable Multimodal Models	Google	2023-12-06	-	-
OneLLM: One Framework to Align All Modalities with Language	arXiv	2023-12-06	Github	Demo
Lenna: Language Enhanced Reasoning Detection Assistant	arXiv	2023-12-05	Github	-
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding	arXiv	2023-12-04	-	-
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	arXiv	2023-12-04	Github	Local Demo
Making Large Multimodal Models Understand Arbitrary Visual Prompts	CVPR	2023-12-01	Github	Demo
Dolphins: Multimodal Language Model for Driving	arXiv	2023-12-01	Github	-
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning	arXiv	2023-11-30	Github	Coming soon
VTimeLLM: Empower LLM to Grasp Video Moments	arXiv	2023-11-30	Github	Local Demo
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model	arXiv	2023-11-30	Github	-
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	arXiv	2023-11-28	Github	Coming soon
LLMGA: Multimodal Large Language Model based Generation Assistant	arXiv	2023-11-27	Github	Demo
ChartLlama: A Multimodal LLM for Chart Understanding and Generation	arXiv	2023-11-27	Github	-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	arXiv	2023-11-21	Github	Demo
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge	arXiv	2023-11-20	Github	-
An Embodied Generalist Agent in 3D World	arXiv	2023-11-18	Github	Demo
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	arXiv	2023-11-16	Github	Demo
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	CVPR	2023-11-14	Github	-
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning	arXiv	2023-11-13	Github	-
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models	arXiv	2023-11-13	Github	Demo
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models	CVPR	2023-11-11	Github	Demo
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents	arXiv	2023-11-09	Github	Demo
NExT-Chat: An LMM for Chat, Detection and Segmentation	arXiv	2023-11-08	Github	Local Demo
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	arXiv	2023-11-07	Github	Demo
OtterHD: A High-Resolution Multi-modality Model	arXiv	2023-11-07	Github	-
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding	arXiv	2023-11-06	Coming soon	-
GLaMM: Pixel Grounding Large Multimodal Model	CVPR	2023-11-06	Github	Demo
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning	arXiv	2023-11-02	Github	-
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	arXiv	2023-10-14	Github	Local Demo
Ferret: Refer and Ground Anything Anywhere at Any Granularity	arXiv	2023-10-11	Github	-
CogVLM: Visual Expert For Large Language Models	arXiv	2023-10-09	Github	Demo
Improved Baselines with Visual Instruction Tuning	arXiv	2023-10-05	Github	Demo
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	ICLR	2023-10-03	Github	Demo
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs	arXiv	2023-10-01	Github	-
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants	arXiv	2023-10-01	Github	Local Demo
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model	arXiv	2023-09-27	-	-
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition	arXiv	2023-09-26	Github	Local Demo
DreamLLM: Synergistic Multimodal Comprehension and Creation	ICLR	2023-09-20	Github	Coming soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models	arXiv	2023-09-18	Coming soon	-
TextBind: Multi-turn Interleaved Multimodal Instruction-following	arXiv	2023-09-14	Github	Demo
NExT-GPT: Any-to-Any Multimodal LLM	arXiv	2023-09-11	Github	Demo
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics	arXiv	2023-09-13	Github	-
ImageBind-LLM: Multi-modality Instruction Tuning	arXiv	2023-09-07	Github	Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning	arXiv	2023-09-05	-	-
PointLLM: Empowering Large Language Models to Understand Point Clouds	arXiv	2023-08-31	Github	Demo
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	arXiv	2023-08-31	Github	Local Demo
MLLM-DataEngine: An Iterative Refinement Approach for MLLM	arXiv	2023-08-25	Github	-
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models	arXiv	2023-08-25	Github	Demo
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities	arXiv	2023-08-24	Github	Demo
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages	ICLR	2023-08-23	Github	Demo
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data	arXiv	2023-08-20	Github	-
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions	arXiv	2023-08-19	Github	Demo
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions	arXiv	2023-08-08	Github	-
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World	ICLR	2023-08-03	Github	Demo
LISA: Reasoning Segmentation via Large Language Model	arXiv	2023-08-01	Github	Demo
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	arXiv	2023-07-31	Github	Local Demo
3D-LLM: Injecting the 3D World into Large Language Models	arXiv	2023-07-24	Github	-
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning	arXiv	2023-07-18	-	Demo
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	arXiv	2023-07-17	Github	Demo
SVIT: Scaling up Visual Instruction Tuning	arXiv	2023-07-09	Github	-
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest	arXiv	2023-07-07	Github	Demo
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?	arXiv	2023-07-05	Github	-
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding	arXiv	2023-07-04	Github	Demo
Visual Instruction Tuning with Polite Flamingo	arXiv	2023-07-03	Github	Demo
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding	arXiv	2023-06-29	Github	Demo
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic	arXiv	2023-06-27	Github	Demo
MotionGPT: Human Motion as a Foreign Language	arXiv	2023-06-26	Github	-
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	arXiv	2023-06-15	Github	Coming soon
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	arXiv	2023-06-11	Github	Demo
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	arXiv	2023-06-08	Github	Demo
MIMIC-IT: Multi-Modal In-Context Instruction Tuning	arXiv	2023-06-08	Github	Demo
M³IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	arXiv	2023-06-07	-	-
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	arXiv	2023-06-05	Github	Demo
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	arXiv	2023-06-01	Github	-
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	arXiv	2023-05-30	Github	Demo
PandaGPT: One Model To Instruction-Follow Them All	arXiv	2023-05-25	Github	Demo
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	arXiv	2023-05-25	Github	-
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models	arXiv	2023-05-24	Github	Local Demo
DetGPT: Detect What You Need via Reasoning	arXiv	2023-05-23	Github	Demo
Pengi: An Audio Language Model for Audio Tasks	NeurIPS	2023-05-19	Github	-
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks	arXiv	2023-05-18	Github	-
Listen, Think, and Understand	arXiv	2023-05-18	Github	Demo
VisualGLM-6B	-	2023-05-17	Github	Local Demo
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	arXiv	2023-05-17	Github	-
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	arXiv	2023-05-11	Github	Local Demo
VideoChat: Chat-Centric Video Understanding	arXiv	2023-05-10	Github	Demo
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	arXiv	2023-05-08	Github	Demo
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	arXiv	2023-05-07	Github	-
LMEye: An Interactive Perception Network for Large Language Models	arXiv	2023-05-05	Github	Local Demo
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	arXiv	2023-04-28	Github	Demo
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	arXiv	2023-04-27	Github	Demo
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	arXiv	2023-04-20	Github	-
Visual Instruction Tuning	NeurIPS	2023-04-17	GitHub	Demo
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention	ICLR	2023-03-28	Github	Demo
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning	ACL	2022-12-21	Github	-

Multimodal Hallucination

Title	Venue	Date	Code	Demo
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs	arXiv	2024-09-20	Link	-
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation	arXiv	2024-08-01	-	-
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs	ECCV	2024-07-31	Github	-
Evaluating and Analyzing Relationship Hallucinations in LVLMs	ICML	2024-06-24	Github	-
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention	arXiv	2024-06-18	Github	-
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models	arXiv	2024-06-04	Coming soon	-
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap	arXiv	2024-05-24	Coming soon	-
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback	arXiv	2024-04-22	-	-
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding	arXiv	2024-03-27	-	-
What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models	arXiv	2024-03-20	Github	-
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization	arXiv	2024-03-13	-	-
Debiasing Multimodal Large Language Models	arXiv	2024-03-08	Github	-
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding	arXiv	2024-03-01	Github	-
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding	arXiv	2024-02-28	-	-
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective	arXiv	2024-02-22	Github	-
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models	arXiv	2024-02-18	Github	-
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs	arXiv	2024-02-06	Github	-
Unified Hallucination Detection for Multimodal Large Language Models	arXiv	2024-02-05	Github	-
A Survey on Hallucination in Large Vision-Language Models	arXiv	2024-02-01	-	-
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models	arXiv	2024-01-18	-	-
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model	arXiv	2023-12-12	Github	-
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations	arXiv	2023-12-06	Github	-
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites	arXiv	2023-12-04	Github	-
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	arXiv	2023-12-01	Github	Demo
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation	CVPR	2023-11-29	Github	-
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding	CVPR	2023-11-28	Github	-
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization	arXiv	2023-11-28	Github	Comins Soon
Mitigating Hallucination in Visual Language Models with Visual Supervision	arXiv	2023-11-27	-	-
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data	arXiv	2023-11-22	Github	-
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation	arXiv	2023-11-13	Github	-
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models	arXiv	2023-11-02	Github	-
Woodpecker: Hallucination Correction for Multimodal Large Language Models	arXiv	2023-10-24	Github	Demo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models	arXiv	2023-10-09	-	-
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption	arXiv	2023-10-03	Github	-
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models	ICLR	2023-10-01	Github	-
Aligning Large Multimodal Models with Factually Augmented RLHF	arXiv	2023-09-25	Github	Demo
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models	arXiv	2023-09-07	-	-
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning	arXiv	2023-09-05	-	-
Evaluation and Analysis of Hallucination in Large Vision-Language Models	arXiv	2023-08-29	Github	-
VIGC: Visual Instruction Generation and Correction	arXiv	2023-08-24	Github	Demo
Detecting and Preventing Hallucinations in Large Vision Language Models	arXiv	2023-08-11	-	-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	ICLR	2023-06-26	Github	Demo
Evaluating Object Hallucination in Large Vision-Language Models	EMNLP	2023-05-17	Github	-

Multimodal In-Context Learning

Title	Venue	Date	Code	Demo
Visual In-Context Learning for Large Vision-Language Models	arXiv	2024-02-18	-	-
Can MLLMs Perform Text-to-Image In-Context Learning?	arXiv	2024-02-02	Github	-
Generative Multimodal Models are In-Context Learners	CVPR	2023-12-20	Github	Demo
Hijacking Context in Large Multi-modal Models	arXiv	2023-12-07	-	-
Towards More Unified In-context Visual Understanding	arXiv	2023-12-05	-	-
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	arXiv	2023-09-14	Github	Demo
Link-Context Learning for Multimodal LLMs	arXiv	2023-08-15	Github	Demo
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models	arXiv	2023-08-02	Github	Demo
Med-Flamingo: a Multimodal Medical Few-shot Learner	arXiv	2023-07-27	Github	Local Demo
Generative Pretraining in Multimodality	ICLR	2023-07-11	Github	Demo
AVIS: Autonomous Visual Information Seeking with Large Language Models	arXiv	2023-06-13	-	-
MIMIC-IT: Multi-Modal In-Context Instruction Tuning	arXiv	2023-06-08	Github	Demo
Exploring Diverse In-Context Configurations for Image Captioning	NeurIPS	2023-05-24	Github	-
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	arXiv	2023-04-19	Github	Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace	arXiv	2023-03-30	Github	Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	arXiv	2023-03-20	Github	Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction	ICCV	2023-03-09	Github	-
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering	CVPR	2023-03-03	Github	-
Visual Programming: Compositional visual reasoning without training	CVPR	2022-11-18	Github	Local Demo
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA	AAAI	2022-06-28	Github	-
Flamingo: a Visual Language Model for Few-Shot Learning	NeurIPS	2022-04-29	Github	Demo
Multimodal Few-Shot Learning with Frozen Language Models	NeurIPS	2021-06-25	-	-

Multimodal Chain-of-Thought

Title	Venue	Date	Code	Demo
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM	arXiv	2024-04-24	Github	Local Demo
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models	arXiv	2024-03-25	Github	Local Demo
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models	NeurIPS	2023-10-25	Github	-
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic	arXiv	2023-06-27	Github	Demo
Explainable Multimodal Emotion Reasoning	arXiv	2023-06-27	Github	-
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	arXiv	2023-05-24	Github	-
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction	arXiv	2023-05-23	-	-
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering	arXiv	2023-05-05	-	-
Caption Anything: Interactive Image Description with Diverse Multimodal Controls	arXiv	2023-05-04	Github	Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings	arXiv	2023-05-03	Coming soon	-

Name		Name	Last commit message	Last commit date
Latest commit History 698 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Multimodal-Large-Language-Models

Our MLLM works

Awesome Papers

Multimodal Instruction Tuning

Multimodal Hallucination

Multimodal In-Context Learning

Multimodal Chain-of-Thought

BradyFU/Awesome-Multimodal-Large-Language-Models

Folders and files

Latest commit

History

Repository files navigation

Awesome-Multimodal-Large-Language-Models

Our MLLM works

Awesome Papers

Multimodal Instruction Tuning

Multimodal Hallucination

Multimodal In-Context Learning

Multimodal Chain-of-Thought