Emergent Mind

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

(2402.11690)
Published Feb 18, 2024 in cs.CL and cs.CV

Abstract

Despite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework and achieves the state-of-the-art performance across a wide range of multi-modal evaluation benchmarks. Finally, we conduct in-depth analyses to understand visual instruction tuning and our findings reveal that: (1) GPT-4 synthesized data does not substantially enhance VLMs' capabilities but rather modulates the model's responses to human-preferred formats; (2) A minimal quantity (e.g., 1,000) of GPT-4 synthesized data can effectively align VLM responses with human-preference; (3) Visual instruction tuning mainly helps large-LLMs to understand visual features.

Example from MM-Vet showing how Vision-Flan improves entity recognition in Vision Language Models (VLMs).

Overview

  • The paper introduces VISION-FLAN, a diverse visual instruction tuning dataset to address challenges in vision-language models (VLMs), including task diversity and dataset annotation errors.

  • VISION-FLAN utilises a two-stage instruction tuning framework, initially fine-tuning VLMs with its dataset and then further refining with GPT-4 synthesized data to align more closely with human preferences.

  • Experimental evaluations show that fine-tuning with VISION-FLAN enhances VLM performance on various benchmarks, emphasizing the pivotal role of visual instruction in improving model capabilities.

  • The research underscores the potential for future work in fine-tuning techniques and the development of generalized models, positioning visual instruction tuning as key to advancing visual understanding in language models.

Vision-Flan: Advancements in Visual Instruction Tuning through Human-Labeled Datasets

Introduction to VISION-FLAN

Recent advancements in vision-language models (VLMs) have demonstrated impressive capabilities, acting as potent visual assistants for a myriad of tasks. Yet, these systems have historically grappled with two main challenges: the first being a scarcity of task diversity in their pretraining and instruction tuning phases, and the second concerning the prevalence of annotation errors and biases in datasets synthesized by models like GPT-4. Addressing these, the paper introduces VISION-FLAN, a novel dataset that emerges as the most diverse publicly available visual instruction tuning dataset to date. Encompassing 187 tasks and over 1.6 million instances sourced from a wide array of academic datasets, and supplemented by expert-written instructions, VISION-FLAN marks a significant stride toward enriching the training landscape of VLMs.

Two-Stage Instruction Tuning Framework

In an innovative approach to instruction tuning, VISION-FLAN employs a two-stage framework. Initially, VLMs undergo fine-tuning on the VISION-FLAN dataset to acquire a broad spectrum of capabilities. This phase yields the VISION-FLAN BASE model. Recognizing the concise nature of academic dataset outputs and their misalignment with user preferences, a subsequent fine-tuning phase on a minimal set of GPT-4 synthesized data is conducted. This sequential method addresses the challenges of generalizability, hallucination, and catastrophic forgetting, presenting a refined model—VISION-FLAN CHAT—that aligns closely with human preferences while necessitating considerably less GPT-4 synthesized data.

Empirical Findings and Analysis

The extensive experimental evaluation demonstrates that models fine-tuned on the VISION-FLAN dataset exhibit superior performance across various multimodal evaluation benchmarks. The incorporation of a rich array of human-labeled tasks substantially boosts the models' capabilities. Intriguingly, the research reveals that while GPT-4 synthesized datasets do not significantly enhance VLMs' capabilities, they play a crucial role in modulating model responses to align with format and style preferences favored by humans. Furthermore, the investigation sheds light on the predominant role of visual instruction tuning in enhancing large-language models' comprehension of visual features, a critical undertaking facilitated largely during the pretraining phase.

Theoretical and Practical Implications

This research holds both theoretical significance in understanding visual instruction tuning's impact on LLMs and practical implications for developing more capable and human-aligned VLMs. The introduction of the VISION-FLAN dataset, coupled with the novel two-stage fine-tuning framework, provides a fertile ground for future inquiries into fine-tuning techniques and the development of generalized models that excel in a broader range of tasks. It positions visual instruction tuning as a critical pivot for advancing the integration of visual understanding within language models, promising enhancements in the utility and applicability of VLMs in real-world scenarios.

Future Directions

The establishment of VISION-FLAN as a diverse visual instruction tuning resource opens avenues for exploring multifaceted instruction tuning strategies, extending beyond the visual domain to incorporate multi-modal and multi-lingual contexts. Future research could delve into refining the synthesis of visual instruction tuning datasets, leveraging advancements in generative models to produce highly diverse, realistic, and human-aligned datasets. As VLMs continue to evolve, the exploration of scalable, efficient fine-tuning mechanisms remains paramount, promising to unveil models with unprecedented versatility and robustness across a spectrum of tasks and domains.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.