Emergent Mind

A Single Transformer for Scalable Vision-Language Modeling

(2407.06438)
Published Jul 8, 2024 in cs.CV , cs.CL , and cs.LG

Abstract

We present SOLO, a single transformer for Scalable visiOn-Language mOdeling. Current large vision-language models (LVLMs) such as LLaVA mostly employ heterogeneous architectures that connect pre-trained visual encoders with LLMs to facilitate visual recognition and complex reasoning. Although achieving remarkable performance with relatively lightweight training, we identify four primary scalability limitations: (1) The visual capacity is constrained by pre-trained visual encoders, which are typically an order of magnitude smaller than LLMs. (2) The heterogeneous architecture complicates the use of established hardware and software infrastructure. (3) Study of scaling laws on such architecture must consider three separate components - visual encoder, connector, and LLMs, which complicates the analysis. (4) The use of existing visual encoders typically requires following a pre-defined specification of image inputs pre-processing, for example, by reshaping inputs to fixed-resolution square images, which presents difficulties in processing and training on high-resolution images or those with unusual aspect ratio. A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs; however, its limited adoption in the modern context likely stems from the absence of reliable training recipes that balance both modalities and ensure stable training for billion-scale models. In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM using moderate academic resources. The training recipe involves initializing from LLMs, sequential pre-training on ImageNet and web-scale data, and instruction fine-tuning on our curated high-quality datasets. On extensive evaluation, SOLO demonstrates performance comparable to LLaVA-v1.5-7B, particularly excelling in visual mathematical reasoning.

A unified transformer processes images and text with special tokens for visual modality encoding.

Overview

  • The paper introduces SOLO, a unified transformer-based vision-language model designed to address scalability and deployment issues present in current models that use smaller, pre-trained visual encoders and heterogeneous architectures.

  • SOLO processes images and text using the same transformer model, and the authors provide an open-source training recipe, including stages of pre-training on ImageNet21K, web-scale datasets, and transition to high-quality datasets, to build and refine the model.

  • Evaluations demonstrate that SOLO performs competitively with mid-2024 models and offers significant advantages in scalability and adaptability, laying a solid foundation for future scalable AI models.

A Single Transformer for Scalable Vision-Language Modeling

Overview and Contribution

The paper introduces "SOLO," a novel large vision-language model (LVLM) employing a single transformer architecture designed to address scalability issues inherent in existing models that rely on pre-trained visual encoders connected with LLMs. The authors identify four core limitations in current LVLMs:

  1. Constrained visual capacity due to smaller pre-trained visual encoders.
  2. Complicated deployment due to heterogeneous architectures.
  3. Complex scaling analysis involving multiple components.
  4. Issues in preprocessing images with fixed resolution requirements, limiting the ability to handle high-resolution or irregularly shaped images effectively.

This work proposes the unified transformer-based SOLO to obviate these limitations by processing both images and text inputs using the same model architecture. A key contribution of the paper is providing the first open-source training recipe for this vision-language modeling approach, including initializations from LLMs, sequential pre-training, and instruction fine-tuning using moderate-sized computational infrastructure (8 x A100 80GB GPUs).

Model Design and Training

The architectural innovation centers on a single transformer model initialized from Mistral-7B-v0.1. This approach involves partitioning images into patches that align with the transformer’s input size, utilizing special tokens for visual encoding. This design choice facilitates high scalability and ease of deployment by circumventing the constraints of pre-trained visual encoders.

The training recipe spans three stages:

  1. Stage-1: Pre-training on ImageNet21K to build foundational visual representations.
  2. Stage-2: Leveraging web-scale datasets for broader knowledge and data volume enhancements.
  3. Stage-3: Annealing to smoothly transition from noisy web data to high-quality curated datasets.

Validation studies confirm that without the initial stage of ImageNet pre-training, models generate meaningless captions despite achieving comparable vision-language modeling loss. This underscores the necessity of a carefully phased training approach.

Evaluation

Extensive evaluations were performed comparing SOLO with existing LVLMs across several benchmarks, including MMStar, MME, and SEED-Bench, as well as specialized datasets like AI2D and MathVista. SOLO exhibited performance on par with mid-2024 models like LLaVA-v1.5-7B, particularly excelling in visual mathematical reasoning. Although currently trailing state-of-the-art (SoTA) LVLMs from late 2024, SOLO showcased substantial advantages in scalability and adaptability, marking it as a strong foundation for future developments.

Implications and Future Directions

The unified transformer architecture's simplification indicates a promising direction for future scalable AI models. Some areas for exploration include:

By addressing current limitations with pre-trained visual encoders, SOLO demonstrates that a unified transformer approach can maintain competitive performance levels while facilitating more straightforward scaling, training, and deployment.

Conclusion

This work signifies a notable shift in vision-language modeling, presenting a scalable, unified transformer-based framework as a viable alternative to models reliant on pre-trained encoders. The extensive analysis and reproducible training recipe provided offer a strong foundation for future research and practical applications in scalable vision-language modeling. As this field advances, the approach and insights detailed in this paper are poised to play a critical role in shaping the next generation of AI systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.