Emergent Mind

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

(2405.19716)
Published May 30, 2024 in cs.CV and cs.CL

Abstract

Large vision language models (LVLMs) integrate LLMs with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data and append its self-generated image descriptions to the prompts. We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. Further studies investigate various components of STIC and highlight its potential to leverage vast quantities of unlabeled images for self-training. Code and data are made publicly available.

Framework overview of STIC, a two-stage algorithm enhancing image comprehension in LVLMs through self-training.

Overview

  • The paper proposes Self-Training on Image Comprehension (STIC), a novel approach to enhancing large vision language models (LVLMs) by leveraging self-generated data instead of extensive supervised data.

  • It employs a two-stage self-training process: the first stage focuses on constructing a preference dataset from unlabeled images, and the second stage integrates these self-generated descriptions into existing instruction-tuning data.

  • Experimental results demonstrate significant performance gains, notably a 4.0% average improvement, while reducing the need for supervised fine-tuning data by 70%.

Self-Training on Image Comprehension for Large Vision Language Models

The paper introduces a novel approach named Self-Training on Image Comprehension (STIC) aimed at improving the image comprehension capabilities of large vision language models (LVLMs). The primary objective is to address the significant challenge posed by the high cost and laborious efforts required to procure high-quality vision-language data necessary for training these models. The methodology innovatively leverages self-generated data, consequently bypassing the need for extensive supervised data typically used in conventional approaches.

Overview of the Approach

The paper details a two-stage self-training algorithm specifically designed for LVLMs.

  1. Stage 1: Image Comprehension Self-Training:

    • The model autonomously constructs a preference dataset for image descriptions using unlabeled images. This involves generating preferred responses through well-constructed prompts and dispreferred responses from either corrupted images or misleading prompts.
    • The preference dataset consists of paired data: preferred responses derived from explicit reasoning prompts and dispreferred responses from either corrupted images or "bad" prompts.
    • The LVLM is subsequently fine-tuned on this self-constructed preference dataset using a modified Direct Policy Optimization (DPO) with an added regularization term to emphasize the preferred responses.
  2. Stage 2: Description-Infused Fine-Tuning:

    • The second stage focuses on reinforcing the model's reasoning abilities by integrating its self-generated image descriptions into existing instruction-tuning data.
    • This stage reuses a small portion of the fine-tuning data from the supervised fine-tuning (SFT) stage, infusing the prompts with model-generated image descriptions and further fine-tuning the LVLM.

Experimental Validation

The experimental results validate the efficacy of the proposed STIC approach. Multiple benchmarks were used to demonstrate performance improvements, including:

STIC results in substantial performance gains, achieving an average improvement of 4.0% over the baseline methods while utilizing 70% less supervised fine-tuning data. Notably, on the ScienceQA dataset, a substantial gain of 6.4% was observed. These robust numerical results underscore the effectiveness of leveraging vast quantities of unlabeled images for self-training, highlighting STIC's potential to reduce the dependency on costly, annotated datasets.

Discussion and Implications

Comparison to Existing Methods

Extensive comparisons were drawn with existing vision-language fine-tuning methodologies such as POVID. While POVID relies on manually injected hallucinations using labeled object information, STIC diverges by exclusively using unlabeled images and self-generated descriptions. This automatic generation process not only simplifies the data preparation stage but also demonstrates superior performance gains.

Contribution of Key Components

Through meticulous ablation studies, the paper underscores the significance of various components within STIC. One such study involves removing dispreferred responses and observing a performance degradation, thereby establishing the crucial role of negative samples in fine-tuning preferences. Additionally, the “describe-and-respond” (DaR) prompting method synergistically enhances STIC's performance, reaffirming the benefit of integrating self-generated descriptions.

Scalability and Generalization

Another salient point is the scaling law observed with STIC. By increasing the amount of generated preference data from 6k to 12k, further performance improvements were evidenced, suggesting that STIC can effectively harness larger datasets. Furthermore, a t-SNE visualization analysis substantiates the correlation between image distribution overlap and performance gains, providing deeper insights into how STIC benefits tasks with similar image distributions to the self-generated dataset.

Future Direction

Given the promising results, future work could delve into exploring more diverse and large-scale datasets to expand the generalizability of STIC. The paper's implications extend to cost-effective training paradigms for LVLMs, potentially reshaping approaches to acquiring and utilizing vision-language data. Moreover, further refinement of the self-training steps and incorporation of more sophisticated negative sampling techniques could yield even more significant advancements in LVLM capabilities.

Conclusion

In summary, the proposed STIC framework represents a significant advancement in self-training methodologies for LVLMs. By efficiently leveraging self-generated data and reducing reliance on extensive supervised fine-tuning datasets, STIC achieves substantial performance improvements across various benchmarks. This work exemplifies a substantial stride toward cost-effective and scalable training paradigms in the realm of vision language models, with promising avenues for future exploration and enhancement.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.