InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning (2305.06500v2)

Published 11 May 2023 in cs.CV and cs.LG

Abstract: Large-scale pre-training and instruction tuning have been successful at creating general-purpose LLMs with broad competence. However, building general-purpose vision-LLMs is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

References (52)

Citations (1,503)

View on Semantic Scholar

Summary

The paper introduces a novel instruction tuning framework that boosts zero-shot performance by 15.0% to 24.8% across diverse vision-language tasks.
The paper employs an instruction-aware Query Transformer and balanced sampling to adapt visual feature extraction for varied applications.
The paper demonstrates robust outcomes, attaining 90.7% accuracy on ScienceQA and outperforming predecessors like BLIP-2 and Flamingo.

Introduction to InstructBLIP

The dream of creating AI models that can handle a diverse set of visual and linguistic tasks through unified instructions is advancing with the InstructBLIP framework. Traditional developments in vision-language pretraining (VLP) have shown promise but fall short when it comes to broad generalization across varied vision-language tasks. InstructBLIP takes a leap forward by utilizing diverse instructional dataset transformations and an innovative "instruction-aware Query Transformer" for better feature extraction. The results are compelling: InstructBLIP achieves state-of-the-art zero-shot performance on 13 datasets it wasn't trained on and shows improvements in fine-tuned tasks with robust accuracy - exemplified by a 90.7% accuracy in ScienceQA with image context.

Vision-Language Instruction Tuning

InstructBLIP's ability to generalize stems from its approach to vision-language instruction tuning. It uses 26 datasets, mapped to 11 task categories, to capture a diverse mix of visual-language tasks. The datasets are split into "held-in" for instruction tuning and "held-out" for zero-shot evaluation. Unique to InstructBLIP is how it introduces instruction-awareness into the feature extraction process. By feeding instruction text to the Query Transformer, the model adjusts visual feature extraction to the task at hand, rendering it more flexible and relevant. Moreover, a balanced sampling approach ensures comprehensive learning across datasets of varying sizes.

Quantitative and Qualitative Evidence

Empirical evidence exhibits InstructBLIP's superiority. It surpasses the predecessor model BLIP-2 and significantly outperforms larger models like Flamingo across all compared datasets. Zero-shot performance sees an average relative improvement ranging from 15.0% to upwards of 24.8%, underscoring the instruction tuning's efficacy. Ablation studies reinforce the necessity of instruction-aware features and balanced sampling - removal of these components leads to marked performance declines. Qualitatively, InstructBLIP deftly navigates complex reasoning, showcases the ability to merge visual cues with stored knowledge, and generates both succinct and introspective responses.

Comparative Studies and Downstream Finetuning

Comparative studies highlight another critical advantage: while instruction tuning and multitask learning yield similar results on familiar datasets, instruction tuning significantly outshines multitask learning on unseen tasks. This indicates the instruction tuning framework's innate capability to extend generalization beyond the training scope. When fine-tuned on downstream tasks, InstructBLIP models start from a stronger initialization point, enhancing their efficiency and effectiveness - which translates to top-tier performance on several benchmarks.

Future of Vision-Language AI

The investigation into InstructBLIP is more than a mere display of its capability; it opens up new possibilities for research in general-purpose multimodal AI. This framework, with its meticulous architecture and robust instruction tuning methods, paves the way for a future where AI models can seamlessly understand and perform tasks involving complex visual and linguistic inputs across diverse scenarios. The open-sourced nature of InstructBLIP models goes one step further, enabling a broader community of researchers and developers to contribute to this progressing field.

Related Papers

GitHub

Tweets

https://twitter.com/cgw96209205/status/1657187778923249667

https://twitter.com/EmergentMind/status/1657705132342034432

https://twitter.com/AIWhispers4U/status/1657153835528503297

https://twitter.com/ORICMOS6502/status/1657095255072677888

https://twitter.com/hafidhsoekma/status/1657052650943422464

https://twitter.com/luisitt12491250/status/1657202141726130177