Papers
Topics
Authors
Recent
2000 character limit reached

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (2301.12597v3)

Published 30 Jan 2023 in cs.CV

Abstract: The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen LLMs. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen LLM. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. nocaps: novel object captioning at scale. In ICCV, pp.  8947–8956, 2019.
  2. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  3. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), NeurIPS, 2020.
  4. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  5. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In CVPR, pp.  18009–18019, 2022a.
  6. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022b.
  7. UNITER: universal image-text representation learning. In ECCV, volume 12375, pp.  104–120, 2020.
  8. Unifying vision-and-language tasks via text generation. arXiv preprint arXiv:2102.02779, 2021.
  9. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  10. Enabling multimodal generation on CLIP via vision-language knowledge distillation. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), ACL Findings, pp.  2383–2395, 2022.
  11. BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), NAACL, pp.  4171–4186, 2019.
  12. Unified language model pre-training for natural language understanding and generation. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), NeurIPS, pp.  13042–13054, 2019.
  13. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
  14. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, pp.  6325–6334, 2017.
  15. From images to textual prompts: Zero-shot VQA with frozen large language models. In CVPR, 2022.
  16. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  17. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pp.  6700–6709, 2019.
  18. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021.
  19. A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), ACL, pp.  2763–2775, 2022.
  20. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73, 2017.
  21. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
  22. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pp.  12888–12900, 2022.
  23. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, pp.  121–137, 2020.
  24. Microsoft COCO: common objects in context. In Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), ECCV, volume 8693, pp.  740–755, 2014.
  25. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  26. MAPL: parameter-efficient adaptation of unimodal pre-trained models for vision-language few-shot prompting. In EACL, 2023.
  27. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  28. Im2text: Describing images using 1 million captioned photographs. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F. C. N., and Weinberger, K. Q. (eds.), NIPS, pp.  1143–1151, 2011.
  29. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pp.  2641–2649, 2015.
  30. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  31. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  32. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Gurevych, I. and Miyao, Y. (eds.), ACL, pp.  2556–2565, 2018.
  33. LXMERT: learning cross-modality encoder representations from transformers. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), EMNLP, pp.  5099–5110, 2019.
  34. Plug-and-play VQA: zero-shot VQA by conjoining large pretrained models with zero training. In EMNLP Findings, 2022.
  35. Multimodal few-shot learning with frozen language models. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), NeurIPS, pp.  200–212, 2021.
  36. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), ICML, pp.  23318–23340, 2022a.
  37. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358, 2021a.
  38. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022b.
  39. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021b.
  40. FILIP: fine-grained interactive language-image pre-training. In ICLR, 2022.
  41. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  42. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  43. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, pp.  18102–18112, 2022.
  44. Vinvl: Making visual representations matter in vision-language models. arXiv preprint arXiv:2101.00529, 2021.
  45. OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Citations (3,282)

Summary

  • The paper introduces a two-stage bootstrapping strategy that aligns frozen image encoders with LLMs using a lightweight Q-Former.
  • It achieves an 8.7% improvement in zero-shot VQA accuracy while excelling in image captioning and retrieval tasks.
  • The approach reduces training parameters and costs, setting a new benchmark in multimodal reasoning for vision-language tasks.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs

Introduction

BLIP-2 introduces a novel vision-language pre-training framework that promises efficiency and high performance by leveraging frozen pre-trained image encoders and LLMs. By employing a two-stage, bootstrapped learning strategy, it optimally aligns image representations with LLMs without the need to retrain extensive components fully. This approach is particularly essential in reducing the computational expense of vision-language pre-training, which has escalated with the increase in model scales.

Methodology

The core innovation of BLIP-2 is the use of a lightweight Querying Transformer (Q-Former) to bridge modality gaps, specifically the alignment between vision and language domains. This model undergoes a two-tier pre-training approach. Figure 1

Figure 1: Overview of BLIP-2's framework. We pre-train a lightweight Querying Transformer following a two-stage strategy to bridge the modality gap.

Stage 1: Vision-Language Representation Learning

The first stage leverages a frozen image encoder to bootstrap the vision-language representation learning process. Within this phase, BLIP-2 employs a multi-objective optimization strategy to ensure the Q-Former can effectively extract visual features relevant to textual data. The objectives include Image-Text Contrastive Learning (ITC), Image-grounded Text Generation (ITG), and Image-Text Matching (ITM). Figure 2

Figure 2: (Left) Model architecture of Q-Former and BLIP-2's first-stage vision-language representation learning objectives.

Stage 2: Vision-to-Language Generative Pre-training

In the second phase, Q-Former helps facilitate vision-to-language generative learning with a frozen LLM. Both decoder-based and encoder-decoder-based LLMs have been evaluated, further emphasizing zero-shot capability and efficiency in extracting image-relevant textual representations for LLMs. Figure 3

Figure 3: BLIP-2's second-stage vision-to-language generative pre-training, leveraging frozen LLMs.

The inclusion of pre-trained, unimodal networks (both vision and language) aids this model in achieving a balanced alignment without catastrophic forgetting, thus allowing the learning of vision-language associations with minimal new parameter training.

Experimental Results

BLIP-2's framework shows impressive performance benchmarks across multiple tasks including visual question answering (VQA), image captioning, and image-text retrieval.

Visual Question Answering

For VQA, BLIP-2 demonstrates improved accuracy, especially in the zero-shot setting. It surpasses the previous state-of-the-art, Flamingo80B, by achieving a performance margin of 8.7% on zero-shot VQAv2, despite utilizing significantly fewer parameters. The system's constructive use of both frozen models enhances its flexibility and generalization ability across unseen data. Figure 4

Figure 4

Figure 4: Effect of vision-language representation learning on vision-to-language generative learning.

Image Captioning

In structured evaluations, BLIP-2 exhibits robust capabilities in both in-domain and zero-shot transfer scenarios for image captioning tasks. Its performance on NoCaps and COCO Caption datasets validates its retrieval prowess and textual generation reliability. This is particularly noteworthy as the results show BLIP-2 outperforming more parameter-heavy models.

Image-Text Retrieval

BLIP-2's finetuned models for image-text retrieval achieve high retrieval accuracy. By tuning on COCO and evaluating on Flickr30K datasets, the model consistently bests competitors in image-to-text and text-to-image retrieval, marking a significant milestone in aligning visual and linguistic information with greater efficacy.

Conclusion

BLIP-2 presents a computationally efficient framework that judiciously employs pre-trained image encoders and LLMs to achieve state-of-the-art results in vision-language tasks with minimal parameter training. This method sets the stage for further research on multi-modal conversational AI systems, opening avenues for enhanced visual understanding and language generation models with broader applications. By reducing computational costs without sacrificing performance, BLIP-2 provides a strategic direction for future advancements in integrated AI systems capable of performing complex, multimodal reasoning tasks.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 4 likes about this paper.