MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (2403.09611v4)

Published 14 Mar 2024 in cs.CV, cs.CL, and cs.LG

Abstract: In this work, we discuss building performant Multimodal LLMs (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

Citations (144)

View on Semantic Scholar

Summary

The paper presents extensive ablations on model architecture and data mix to enhance multimodal LLM pre-training performance.
It identifies critical impacts from image resolution, vision-language connectors, and data types for both zero-shot and few-shot tasks.
The study scales models up to 30B parameters and applies supervised fine-tuning to attain state-of-the-art benchmarks across vision-language tasks.

The paper "MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training" presents a comprehensive paper on building high-performance Multimodal LLMs (MLLMs). The work emphasizes the significance of various architectural components and data choices through extensive ablations. The authors formulate design lessons intended to guide future research in the field.

The paper's contributions can be summarized as:

Ablations on small-scale models to analyze the impact of model architecture and pre-training data choices.
Identification of key trends related to image resolution, visual encoder loss, and visual encoder pre-training data.
Demonstration of the importance of interleaved image-text and text-only training data for few-shot performance, and caption data for zero-shot performance.
Scaling up the model to larger LLMs of 3B, 7B, and 30B parameters, including mixture-of-experts (MoE) models.
Achieving state-of-the-art (SOTA) performance on pre-training metrics and competitive results on multimodal benchmarks after supervised fine-tuning (SFT).

The paper explores three main areas of design decisions: architecture, data, and training procedure.

Architecture Ablations

The paper analyzes components that enable an LLM to process visual data, focusing on visual encoder pre-training and bridging visual features to the LLM.

Image Encoder Pre-training: The authors investigate the impact of image resolution and image encoder pre-training objective, using both contrastive losses (CLIP) and reconstructive losses (AIM). Key findings include the significant impact of image resolution, followed by model size and training data composition. Increasing image resolution from 224 to 336 results in an approximate 3% boost in metrics.
Vision-Language Connector and Image Resolution: The research explores different ways to translate visual representations to the LLM's (LLM) space, considering the number of tokens representing the image ($64$ or $144$), image resolution ($224$ or $336$), and architectural options such as average pooling, attention pooling, and convolutional mapping. The number of visual tokens and image resolution are critical, while the specific vision-language connector architecture has less impact.

Pre-training Data Ablation

The paper emphasizes the importance of large-scale, task-appropriate data for training high-performance models and discusses data choices for the pre-training stage, including captioning data, interleaved image-text documents, and text-only data. The paper uses a simplified model setup for ablations and evaluates zero-shot and few-shot performance on captioning and visual question answering (VQA) tasks.

Key data lessons include:

Interleaved data is crucial for few-shot and text-only performance, whereas captioning data improves zero-shot performance.
Text-only data enhances few-shot and text-only performance.
Careful mixing of image and text data leads to optimal multimodal performance while retaining strong text performance.
Synthetic data boosts few-shot learning.

Final Model and Training Recipe

Based on the ablation results, the authors define the final recipe for MM1 multimodal pre-training:

A ViT-H model with $378 \times 378$ resolution, pre-trained with a CLIP objective on DFN-5B.
A vision-language connector with 144 tokens, using the C-Abstractor architecture.
A data mix of 45% interleaved image-text documents, 45% image-text pair documents, and 10% text-only documents.

The paper scales up the LLM size to 3B, 7B, and 30B parameters, initializing the image encoder and LLM decoder weights from in-house pre-trained models. Multimodal pre-training is then performed on the data mix for 200k steps.

The authors determine the optimal peak learning rate $\eta$ based on the number of non-embedding parameters $N$ using the following equation:

$\eta = \exp(-0.4214\ln(N) - 0.5535)$

Where:

$\eta$ is the optimal peak learning rate
$N$ is the number of non-embedding parameters.

Supervised Fine-Tuning

The paper details SFT (Supervised Fine-Tuning) experiments on top of the pre-trained models, using a data mixture of instruction-response pairs generated by GPT-4, academic task-oriented vision-language (VL) datasets, and text-only SFT (Supervised Fine-Tuning) data. During SFT (Supervised Fine-Tuning), both the image encoder and the LLM backbone remain unfrozen. High-resolution SFT (Supervised Fine-Tuning) is supported through positional embedding interpolation and sub-image decomposition.

The results indicate that MM1 models achieve state-of-the-art performance compared to other models of the same size, particularly on VQAv2 (Visual Question Answering version 2), Text*VQA, ScienceQA, and MMMU. **MoE* (Mixture of Experts) models outperform their dense counterparts, demonstrating the potential of MoE (Mixture of Experts) for further scaling. The impact of image resolution and pre-training on SFT (Supervised Fine-Tuning) performance is also highlighted.

The authors conclude that the design lessons identified in this work can aid the community in building robust models beyond specific architectures or data strategies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/DrJimFan/status/1769053019939967080

https://twitter.com/rasbt/status/1790013057659183601

https://twitter.com/IOHK_Charles/status/1769638061476835476

https://twitter.com/NandoDF/status/1772220903889183076

https://twitter.com/_akhaliq/status/1768464429035692425

https://twitter.com/arankomatsuzaki/status/1768446729710371115