Emergent Mind

VILA: On Pre-training for Visual Language Models

(2312.07533)
Published Dec 12, 2023 in cs.CV

Abstract

Visual language models (VLMs) rapidly progressed with the recent success of LLMs. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe we build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells and whistles. Multi-modal pre-training also helps unveil appealing properties of VILA, including multi-image reasoning, enhanced in-context learning, and better world knowledge.

Overview

  • This paper explores the process of visual language pre-training, which augments LLMs with visual input capabilities to create visual language models (VLMs).

  • The authors identify key pre-training factors, such as not freezing the LLM, using interleaved pre-training data, and incorporating text-only instruction data in supervised fine-tuning.

  • The paper introduces VILA (Visual Instruction tuning with Linear Attention), a pre-training design that outperforms existing models in several benchmarks and showcases multi-image reasoning and robust in-context learning.

  • VILA is trained using a multi-stage approach, starting with projector initialization, then pre-training on a visual language corpus, and fine-tuning with visual instruction datasets.

  • The study concludes that the insight gained from the pre-training process can guide the creation of more effective VLMs, and suggests future research directions for enhancing VLM performance.

Visual Language Model Pre-training

Introduction and Context

Recent advancements in AI research have shown considerable improvements by extending LLMs to incorporate visual inputs, creating visual language models (VLMs). These models have shown promising results in comprehending and generating content that combines both text and visual information, a process known as "multimodal learning." A critical component in the development of VLMs has been the pre-training process, where a model is trained on a large dataset that includes both text and images. However, the specifics of augmenting an LLM with visual capabilities, known as visual language pre-training, have not been deeply explored. This work aims to fill that gap by examining various design approaches for visual language pre-training.

Pre-training Factors and Findings

The study identifies three key findings from the augmentation process. Firstly, while freezing the LLM during pre-training can produce acceptable results in zero-shot tasks (where the model makes predictions without seeing similar examples), it falls short in tasks that require in-context learning. Here, unfreezing or updating the LLM proves to be crucial. Secondly, incorporating interleaved pre-training data, which includes combined text and image datasets with text segments interspersed with pictures, offers substantial benefits. It provides more precise gradient updates and helps maintain text-only capabilities. Lastly, adding text-only instruction data to image-text data during supervised fine-tuning (SFT) not only helps recover the model's text-only task degradation but also improves accuracy in visual language tasks. These insights are critical in designing pre-training regimes for future VLMs.

Training Strategies and Outcomes

The study's proposed pre-training design, named VILA (Visual Instruction tuning with Linear Attention), consistently surpasses state-of-the-art models across various benchmarks. Moreover, VILA showcases additional capabilities, such as multi-image reasoning and robust in-context learning, even when presented with inputs it has not been explicitly trained on.

Model Training and Evaluation

VILA is trained in multiple stages, starting with projector initialization and followed by pre-training on visual language corpora. It's then fine-tuned via visual instruction datasets with dataset-specific prompts. The evaluations used a variety of visual language tasks to gauge the model's performance in zero-shot and few-shot settings, reflecting its in-context learning capabilities.

Conclusion and Future Considerations

The findings from this study offer a clear pathway toward creating more effective VLMs by identifying crucial aspects of the visual language pre-training process. The resulting VILA model showcases improved performance across numerous visual language tasks without compromising its text-only abilities. Future research could further enhance these findings by exploring additional pre-training datasets, optimizing training throughput, and scaling up the pre-training corpus.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
HackerNews