SOLO: A Single Transformer for Scalable Vision-Language Modeling (2407.06438v3)

Published 8 Jul 2024 in cs.CV, cs.CL, and cs.LG

Abstract: We present SOLO, a single transformer for Scalable visiOn-LLMing. Current large vision-LLMs (LVLMs) such as LLaVA mostly employ heterogeneous architectures that connect pre-trained visual encoders with LLMs to facilitate visual recognition and complex reasoning. Although achieving remarkable performance with relatively lightweight training, we identify four primary scalability limitations: (1) The visual capacity is constrained by pre-trained visual encoders, which are typically an order of magnitude smaller than LLMs. (2) The heterogeneous architecture complicates the use of established hardware and software infrastructure. (3) Study of scaling laws on such architecture must consider three separate components - visual encoder, connector, and LLMs, which complicates the analysis. (4) The use of existing visual encoders typically requires following a pre-defined specification of image inputs pre-processing, for example, by reshaping inputs to fixed-resolution square images, which presents difficulties in processing and training on high-resolution images or those with unusual aspect ratio. A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs; however, its limited adoption in the modern context likely stems from the absence of reliable training recipes that balance both modalities and ensure stable training for billion-scale models. In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM using moderate academic resources. The training recipe involves initializing from LLMs, sequential pre-training on ImageNet and web-scale data, and instruction fine-tuning on our curated high-quality datasets. On extensive evaluation, SOLO demonstrates performance comparable to LLaVA-v1.5-7B, particularly excelling in visual mathematical reasoning.

Citations (5)

View on Semantic Scholar

Summary

The paper presents a unified transformer that efficiently processes both images and text to overcome scalability limitations in existing LVLMs.
It outlines a three-stage training recipe integrating ImageNet pre-training, web-scale datasets, and instruction fine-tuning to build robust representations.
Validation on diverse benchmarks shows SOLO’s competitive performance and domain adaptability, particularly in visual mathematical reasoning.

A Single Transformer for Scalable Vision-LLMing

Overview and Contribution

The paper introduces "SOLO," a novel large vision-LLM (LVLM) employing a single transformer architecture designed to address scalability issues inherent in existing models that rely on pre-trained visual encoders connected with LLMs. The authors identify four core limitations in current LVLMs:

Constrained visual capacity due to smaller pre-trained visual encoders.
Complicated deployment due to heterogeneous architectures.
Complex scaling analysis involving multiple components.
Issues in preprocessing images with fixed resolution requirements, limiting the ability to handle high-resolution or irregularly shaped images effectively.

This work proposes the unified transformer-based SOLO to obviate these limitations by processing both images and text inputs using the same model architecture. A key contribution of the paper is providing the first open-source training recipe for this vision-LLMing approach, including initializations from LLMs, sequential pre-training, and instruction fine-tuning using moderate-sized computational infrastructure (8 x A100 80GB GPUs).

Model Design and Training

The architectural innovation centers on a single transformer model initialized from Mistral-7B-v0.1. This approach involves partitioning images into patches that align with the transformer’s input size, utilizing special tokens for visual encoding. This design choice facilitates high scalability and ease of deployment by circumventing the constraints of pre-trained visual encoders.

The training recipe spans three stages:

Stage-1: Pre-training on ImageNet21K to build foundational visual representations.
Stage-2: Leveraging web-scale datasets for broader knowledge and data volume enhancements.
Stage-3: Annealing to smoothly transition from noisy web data to high-quality curated datasets.

Validation studies confirm that without the initial stage of ImageNet pre-training, models generate meaningless captions despite achieving comparable vision-LLMing loss. This underscores the necessity of a carefully phased training approach.

Evaluation

Extensive evaluations were performed comparing SOLO with existing LVLMs across several benchmarks, including MMStar, MME, and SEED-Bench, as well as specialized datasets like AI2D and MathVista. SOLO exhibited performance on par with mid-2024 models like LLaVA-v1.5-7B, particularly excelling in visual mathematical reasoning. Although currently trailing state-of-the-art (SoTA) LVLMs from late 2024, SOLO showcased substantial advantages in scalability and adaptability, marking it as a strong foundation for future developments.

Implications and Future Directions

The unified transformer architecture's simplification indicates a promising direction for future scalable AI models. Some areas for exploration include:

Improving SOLO's foundational language abilities without compromising vision-capabilities through incorporation of higher-quality textual datasets.
Establishing reliable metrics that can accurately forecast downstream task performance during the pre-training phase.
Enhancing supervised fine-tuning datasets to mitigate overfitting risks associated with repetitive exposure.

By addressing current limitations with pre-trained visual encoders, SOLO demonstrates that a unified transformer approach can maintain competitive performance levels while facilitating more straightforward scaling, training, and deployment.

Conclusion

This work signifies a notable shift in vision-LLMing, presenting a scalable, unified transformer-based framework as a viable alternative to models reliant on pre-trained encoders. The extensive analysis and reproducible training recipe provided offer a strong foundation for future research and practical applications in scalable vision-LLMing. As this field advances, the approach and insights detailed in this paper are poised to play a critical role in shaping the next generation of AI systems.

Related Papers

Tweets

https://twitter.com/YangyiChen6666/status/1811245679609553303

https://twitter.com/fly51fly/status/1812237578827084056

https://twitter.com/main_horse/status/1810870739421253765

https://twitter.com/_vztu/status/1811459546109608055

https://twitter.com/gzlin/status/1825643601432379552