SVIT: Scaling up Visual Instruction Tuning

Published 9 Jul 2023 in cs.CV | (2307.04087v3)

Abstract: Thanks to the emerging of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, question answering, etc. Although existing multimodal models present impressive performance of visual understanding and reasoning, their limits are still largely under-explored due to the scarcity of high-quality instruction tuning data. To push the limits of multimodal capability, we Scale up Visual Instruction Tuning (SVIT) by constructing a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image descriptions. Besides the volume, the proposed dataset is also featured by the high quality and rich diversity, which is generated by prompting GPT-4 with the abundant manual annotations of images. We also propose a new data recipe to select subset with better diversity and balance, which evokes model's superior capabilities. Extensive experiments verify that SVIT-v1.5, trained on the proposed dataset, outperforms state-of-the-art Multimodal LLMs on popular benchmarks. The data and code are publicly available at https://github.com/BAAI-DCAI/Visual-Instruction-Tuning.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (106)

View on Semantic Scholar

Summary

The paper introduces the novel SVIT dataset with 4.2M visual instruction examples to boost multimodal LLM training.
It leverages GPT-4 to generate diverse, detailed visual queries from captions, object names, and region data for deeper reasoning.
Experimental results reveal that SVIT-v1.5 outperforms current benchmarks in visual reasoning, captioning, and complex QA tasks.

Scaling Up Visual Instruction Tuning with SVIT

The paper "SVIT: Scaling up Visual Instruction Tuning" presents an innovative approach to enhancing the capabilities of Multimodal LLMs (MLLMs) by constructing a comprehensive visual instruction dataset termed SVIT. This dataset is designed to augment the training process of vision-LLMs, pushing the boundaries of their ability to perform tasks such as visual captioning, question answering (QA), and complex visual reasoning.

Dataset Composition and Methodology

SVIT is an extensive dataset composed of 4.2 million visual instruction tuning instances, including 1.6 million conversation QA pairs, 1.6 million complex reasoning QA pairs, 1.0 million referring QA pairs, and 106 thousand detailed image descriptions. The dataset is meticulously curated, with the authors leveraging GPT-4 to generate high-quality and diverse data from the rich annotations provided by the Visual Genome and COCO datasets. Notably, this work addresses the scarcity of comprehensive and informative instruction data in existing multimodal datasets, which are often limited in scale and complexity.

The innovative data generation process involves prompting GPT-4 with three types of information: image captions, object names and bounding boxes, and detailed region descriptions. The paper emphasizes generating questions that necessitate a deep understanding of the scene, encouraging GPT-4 to produce responses that reflect complex reasoning and detailed perception. This approach marks a significant departure from simpler methods that rely on short, descriptive captions, advancing the state-of-the-art by training on more challenging and diverse datasets.

Experimental Validation

The authors conducted exhaustive experiments to validate the efficacy of SVIT. They demonstrated that their trained model, SVIT-v1.5, surpasses existing state-of-the-art MLLMs on a broad spectrum of benchmarks, including popular tasks like Visual Question Answering and newly introduced challenges such as MME perception and cognition. The results reveal the practical advantages of scaling up visual instruction tuning while introducing new methodologies for selecting diverse and balanced training data. For instance, SVIT's datasets significantly improve models' performance in discerning object existence and relationships, counting accurately, and understanding complex spatial and semantic relations within an image.

Implications and Future Work

The creation of a large-scale, high-quality instruction dataset like SVIT has profound implications for the future development of MLLMs. By challenging models with complex questions and detailed reasoning tasks, such datasets could dramatically enhance the models' abilities to integrate and process multimodal information intelligently. Enhanced training protocols based on SVIT should lead to improved zero-shot and few-shot learning performance in MLLMs, broadening their applicability across diverse AI domains.

The authors highlight potential future advancements, such as refining coreset selection algorithms to further optimize data efficiency and expanding the dataset to include more complex and nuanced visual scenes. They also suggest opportunities for extending SVIT's applications beyond standard benchmarks, perhaps integrating it into real-world multimodal tasks that require superior visual reasoning capabilities.

In conclusion, SVIT represents a pivotal contribution to visual instruction tuning, providing a new avenue for advancing the field of multimodal learning through comprehensive and challenging datasets. This work paves the way for future research that can leverage larger and more informative datasets, thereby continuously pushing the limits of what multimodal models can achieve.

Markdown Report Issue