Emergent Mind

Abstract

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

Overview

  • InstructBLIP advances AI in understanding diverse visual and linguistic tasks using a novel instruction-aware feature extraction method.

  • The framework utilizes 26 datasets across 11 task categories, improving zero-shot performance and fine-tuning robustness with impressive accuracy.

  • Through empirical studies, InstructBLIP has shown notable improvements over previous models and demonstrates strong generalization capabilities.

  • InstructBLIP is positioned to enable future research in general-purpose multimodal AI and invites community collaboration with its open-sourced models.

Introduction to InstructBLIP

The dream of creating AI models that can handle a diverse set of visual and linguistic tasks through unified instructions is advancing with the InstructBLIP framework. Traditional developments in vision-language pretraining (VLP) have shown promise but fall short when it comes to broad generalization across varied vision-language tasks. InstructBLIP takes a leap forward by utilizing diverse instructional dataset transformations and an innovative "instruction-aware Query Transformer" for better feature extraction. The results are compelling: InstructBLIP achieves state-of-the-art zero-shot performance on 13 datasets it wasn't trained on and shows improvements in fine-tuned tasks with robust accuracy - exemplified by a 90.7% accuracy in ScienceQA with image context.

Vision-Language Instruction Tuning

InstructBLIP's ability to generalize stems from its approach to vision-language instruction tuning. It uses 26 datasets, mapped to 11 task categories, to capture a diverse mix of visual-language tasks. The datasets are split into "held-in" for instruction tuning and "held-out" for zero-shot evaluation. Unique to InstructBLIP is how it introduces instruction-awareness into the feature extraction process. By feeding instruction text to the Query Transformer, the model adjusts visual feature extraction to the task at hand, rendering it more flexible and relevant. Moreover, a balanced sampling approach ensures comprehensive learning across datasets of varying sizes.

Quantitative and Qualitative Evidence

Empirical evidence exhibits InstructBLIP's superiority. It surpasses the predecessor model BLIP-2 and significantly outperforms larger models like Flamingo across all compared datasets. Zero-shot performance sees an average relative improvement ranging from 15.0% to upwards of 24.8%, underscoring the instruction tuning's efficacy. Ablation studies reinforce the necessity of instruction-aware features and balanced sampling - removal of these components leads to marked performance declines. Qualitatively, InstructBLIP deftly navigates complex reasoning, showcases the ability to merge visual cues with stored knowledge, and generates both succinct and introspective responses.

Comparative Studies and Downstream Finetuning

Comparative studies highlight another critical advantage: while instruction tuning and multitask learning yield similar results on familiar datasets, instruction tuning significantly outshines multitask learning on unseen tasks. This indicates the instruction tuning framework's innate capability to extend generalization beyond the training scope. When fine-tuned on downstream tasks, InstructBLIP models start from a stronger initialization point, enhancing their efficiency and effectiveness - which translates to top-tier performance on several benchmarks.

Future of Vision-Language AI

The investigation into InstructBLIP is more than a mere display of its capability; it opens up new possibilities for research in general-purpose multimodal AI. This framework, with its meticulous architecture and robust instruction tuning methods, paves the way for a future where AI models can seamlessly understand and perform tasks involving complex visual and linguistic inputs across diverse scenarios. The open-sourced nature of InstructBLIP models goes one step further, enabling a broader community of researchers and developers to contribute to this progressing field.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Chatgpt. https://openai.com/blog/chatgpt

  2. Vicuna. https://github.com/lm-sys/FastChat

  3. nocaps: novel object captioning at scale. In ICCV, pages 8948–8957
  4. Flamingo: a visual language model for few-shot learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, NeurIPS
  5. Language Models are Few-Shot Learners
  6. Unifying Vision-and-Language Tasks via Text Generation
  7. Scaling Instruction-Finetuned Language Models
  8. Visual dialog. In CVPR
  9. Palm-e: An embodied multimodal language model
  10. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
  11. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, July 2017.
  12. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR
  13. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
  14. Lora: Low-rank adaptation of large language models. In ICLR
  15. Promptcap: Prompt-guided task-aware image captioning
  16. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR
  17. Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  18. The hateful memes challenge: Detecting hate speech in multimodal memes. In NeurIPS
  19. Lavis: A library for language-vision intelligence
  20. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML
  21. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML
  22. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS
  23. Microsoft coco: Common objects in context. In ECCV
  24. Visual spatial reasoning. Transactions of the Association for Computational Linguistics
  25. Visual instruction tuning. 2023.
  26. Decoupled weight decay regularization. In ICLR
  27. 12-in-1: Multi-task vision and language representation learning. In CVPR
  28. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS
  29. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In NeurIPS Track on Datasets and Benchmarks
  30. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR
  31. Ocr-vqa: Visual question answering by reading text in images. In ICDAR
  32. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, ECCV
  33. GPT-4 Technical Report
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research
  35. Multitask prompted training enables zero-shot task generalization. In ICLR
  36. A-okvqa: A benchmark for visual question answering using world knowledge. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, ECCV
  37. Prompting large language models with answer heuristics for knowledge-based visual question answering. Computer Vision and Pattern Recognition (CVPR)
  38. Textcaps: a dataset for image captioningwith reading comprehension. 2020.
  39. Towards vqa models that can read. In CVPR, pages 8317–8326
  40. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  41. LLaMA: Open and Efficient Foundation Language Models
  42. Cider: Consensus-based image description evaluation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575
  43. Git: A generative image-to-text transformer for vision and language
  44. Self-Instruct: Aligning Language Models with Self-Generated Instructions
  45. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In EMNLP
  46. Finetuned language models are zero-shot learners. In ICLR
  47. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM International Conference on Multimedia, page 1645–1653
  48. MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
  49. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, pages 1686–1697
  50. mplug-owl: Modularization empowers large language models with multimodality. 2023.
  51. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2
  52. Minigpt-4: Enhancing vision-language understanding with advanced large language models

Show All 52