Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.
The paper introduces innovative methods to enhance Large Multimodal Models (LMMs) for visual instruction tuning, using the LLaVA framework with strategic modifications to improve performance.
Key contributions include the adoption of a CLIP-ViT-L-336px with an MLP projection and the incorporation of VQA data with specific response formatting, achieving state-of-the-art results using a modest dataset and computational resources.
The study suggests that simpler architectures and minimalistic training approaches can outperform complex models, proposing future research into visual resamplers, multimodal data integration, and computational efficiency optimizations.
The paper entitled "Improved Baselines with Visual Instruction Tuning," authored by Haotian Liu et al., presents innovative approaches to enhancing the efficiency and effectiveness of Large Multimodal Models (LMMs) in visual instruction tuning. The study focuses on the LLaVA framework, providing substantial evidence that simple yet strategic modifications can significantly improve these models' performance across a variety of benchmarks using publicly available data.
The primary contributions of the paper are two-fold: showcasing the efficacy of fully-connected vision-language connectors in LLaVA and proposing enhancements that streamline the architecture's training and improve its performance. The authors' modifications include:
The experimental setup used a relatively modest training corpus of 1.2 million publicly available images, achieving state-of-the-art results across 11 benchmarks. Notably, the training process required approximately one day on a single 8-A100 node, highlighting the method's computational efficiency. Table~\ref{tab:results} in the original document presents a comprehensive comparison, showing the model's superior performance relative to other contemporaneous methods such as InstructBLIP and Qwen-VL.
The results underscore the utility of visual instruction tuning over extensive pretraining. Despite its minimalistic architecture, LLaVA-1.5, enhanced with simple modifications, significantly outperformed models that employ intricate resamplers and extensive pretraining. The findings pose critical questions regarding the necessity and efficiency of using large-scale datasets and sophisticated vision samplers, suggesting that simpler architectures might suffice for state-of-the-art LMM performance.
The study opens several avenues for future research, including:
The paper "Improved Baselines with Visual Instruction Tuning" successfully demonstrates that substantial gains in LMM capabilities can be achieved through strategic, simple modifications to existing frameworks. The enhanced LLaVA-1.5 model's performance across multiple benchmarks suggests a promising direction for future multimodal research that emphasizes efficiency and accessibility without compromising on performance. The study facilitates a re-evaluation of current LMM training paradigms and paves the way for further innovations in visual instruction tuning.