Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

9 tokens/sec

GPT-4o

12 tokens/sec

Gemini 2.5 Pro Pro

40 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

MM-LLMs: Recent Advances in MultiModal Large Language Models (2401.13601v5)

Published 24 Jan 2024 in cs.CL

Abstract: In the past year, MultiModal LLMs (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Initially, we outline general design formulations for model architecture and training pipeline. Subsequently, we introduce a taxonomy encompassing 126 MM-LLMs, each characterized by its specific formulations. Furthermore, we review the performance of selected MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Finally, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.

References (191)

Citations (115)

View on Semantic Scholar

Summary

The paper introduces a novel architecture combining modality encoders, input/output projectors, and an LLM backbone to enhance multimodal processing capabilities.
It details a two-stage training pipeline—MM pre-training and instruction tuning—that leverages pre-trained unimodal models for efficiency.
State-of-the-art models like Flamingo, BLIP-2, and MiniGPT series set new benchmarks and pave the way for research in any-to-any modality conversion.

Introduction

The field of MultiModal LLMs (MM-LLMs) has seen significant expansion, leveraging pre-trained unimodal models to mitigate the computational costs associated with training from scratch. These models not only excel in natural language understanding and generation but also in processing and generating MultiModal (MM) content, thus advancing closer to artificial general intelligence.

Architectural Composition and Training Pipeline

MM-LLMs are composed of five architectural elements: Modality Encoder, Input Projector, LLM Backbone, Output Projector, and Modality Generator. The diverse modalities processed by these components underscore the complexity and capability of MM-LLMs. The training pipeline is split into MM Pre-Training (PT) and MM Instruction-Tuning (IT), focusing on enhancing the LLM's textual abilities to support MM input/output. A notable shift in the field is the realigned focus on training strategies that optimize model efficiency, given the exorbitant costs associated with training MM-LLMs.

State-of-the-Art Models

A broad spectrum of MM-LLMs, each with unique features, has been introduced to address various MM tasks. Models like Flamingo and BLIP-2 emphasize MM understanding and exhibit text generation prompted by natural language. Conversely, other models like MiniGPT-4 and MiniGPT-5 demonstrate capabilities of both input and output in multiple modalities. The evolution of technology has led to models with innovative structures such as NExT-GPT and CoDi-2, which attempt to create end-to-end MM systems without relying on a cascade of processes.

Benchmarks and Emerging Directions

The performance assessment of MM-LLMs has been standardized across numerous mainstream benchmarks, providing insight into the models' effectiveness and guiding future enhancements. Future trajectories for MM-LLMs encompass augmentation in modalities and LLMs, improvement in datasets, and progression toward any-to-any modality conversion. Furthermore, more comprehensive, practical, and challenging benchmarks are called upon to thoroughly evaluate MM-LLMs. Additionally, potent directions such as deploying lightweight models, amalgamating embodied intelligence, and advancing continual IT depict a roadmap for future research endeavors.

Understanding the intricate interplay between different modalities and harnessing the power of pre-existing LLMs, MM-LLMs continue to revolutionize the capabilities of AI systems, drawing them ever closer to mimicking human intelligence within computational limitations. This survey serves as an essential compass for researchers navigating the MM-LLMs landscape, marking pathways to uncharted terrains that await exploration.

PDF Markdown

Tweets

https://twitter.com/_akhaliq/status/1750382507915141410

https://twitter.com/TheTuringPost/status/1782851609292329391

https://twitter.com/omarsar0/status/1751705692010864972

https://twitter.com/kuanhoong/status/1751802350031749258

https://twitter.com/ai_database/status/1750465822211109128

https://twitter.com/IntuitMachine/status/1753495655106969945

YouTube

Show All Videos