Emergent Mind

MM-LLMs: Recent Advances in MultiModal Large Language Models

(2401.13601)
Published Jan 24, 2024 in cs.CL

Abstract

In the past year, MultiModal LLMs (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Initially, we outline general design formulations for model architecture and training pipeline. Subsequently, we introduce a taxonomy encompassing $122$ MM-LLMs, each characterized by its specific formulations. Furthermore, we review the performance of selected MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Finally, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.

General architecture of MM-LLMs with implementation choices for each component detailed.

Overview

  • MM-LLMs use pre-trained unimodal models to efficiently progress in natural language understanding and generating multimodal content.

  • Five architectural components structure MM-LLMs, with a focus on multi-modal pre-training and instruction tuning for optimizing model efficiency.

  • Diverse MM-LLMs like Flamingo, BLIP-2, MiniGPT, NExT-GPT, and CoDi-2 are designed for varying MM tasks, demonstrating advanced features in understanding and generation.

  • Performance is measured across standardized benchmarks, indicating model effectiveness and outlining avenues for further enhancements.

  • Future research in MM-LLMs is heading towards more complex benchmarks, model augmentation, and bridging the gap between AI and human-like intelligence.

Introduction

The field of MultiModal LLMs (MM-LLMs) has seen significant expansion, leveraging pre-trained unimodal models to mitigate the computational costs associated with training from scratch. These models not only excel in natural language understanding and generation but also in processing and generating MultiModal (MM) content, thus advancing closer to artificial general intelligence.

Architectural Composition and Training Pipeline

MM-LLMs are composed of five architectural elements: Modality Encoder, Input Projector, LLM Backbone, Output Projector, and Modality Generator. The diverse modalities processed by these components underscore the complexity and capability of MM-LLMs. The training pipeline is split into MM Pre-Training (PT) and MM Instruction-Tuning (IT), focusing on enhancing the LLM's textual abilities to support MM input/output. A notable shift in the field is the realigned focus on training strategies that optimize model efficiency, given the exorbitant costs associated with training MM-LLMs.

State-of-the-Art Models

A broad spectrum of MM-LLMs, each with unique features, has been introduced to address various MM tasks. Models like Flamingo and BLIP-2 emphasize MM understanding and exhibit text generation prompted by natural language. Conversely, other models like MiniGPT-4 and MiniGPT-5 demonstrate capabilities of both input and output in multiple modalities. The evolution of technology has led to models with innovative structures such as NExT-GPT and CoDi-2, which attempt to create end-to-end MM systems without relying on a cascade of processes.

Benchmarks and Emerging Directions

The performance assessment of MM-LLMs has been standardized across numerous mainstream benchmarks, providing insight into the models' effectiveness and guiding future enhancements. Future trajectories for MM-LLMs encompass augmentation in modalities and LLMs, improvement in datasets, and progression toward any-to-any modality conversion. Furthermore, more comprehensive, practical, and challenging benchmarks are called upon to thoroughly evaluate MM-LLMs. Additionally, potent directions such as deploying lightweight models, amalgamating embodied intelligence, and advancing continual IT depict a roadmap for future research endeavors.

Understanding the intricate interplay between different modalities and harnessing the power of pre-existing LLMs, MM-LLMs continue to revolutionize the capabilities of AI systems, drawing them ever closer to mimicking human intelligence within computational limitations. This survey serves as an essential compass for researchers navigating the MM-LLMs landscape, marking pathways to uncharted terrains that await exploration.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube