Emergent Mind

Abstract

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen LLMs. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. nocaps: novel object captioning at scale. In ICCV, pp.  8947–8956
  2. Flamingo: a Visual Language Model for Few-Shot Learning
  3. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), NeurIPS
  4. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR
  5. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In CVPR, pp.  18009–18019, 2022a.
  6. PaLI: A Jointly-Scaled Multilingual Language-Image Model
  7. UNITER: universal image-text representation learning. In ECCV, volume 12375, pp.  104–120
  8. Unifying Vision-and-Language Tasks via Text Generation
  9. Scaling Instruction-Finetuned Language Models
  10. Enabling multimodal generation on CLIP via vision-language knowledge distillation. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), ACL Findings, pp.  2383–2395
  11. BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), NAACL, pp.  4171–4186
  12. Unified language model pre-training for natural language understanding and generation. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), NeurIPS, pp.  13042–13054
  13. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
  14. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, pp.  6325–6334
  15. From images to textual prompts: Zero-shot VQA with frozen large language models. In CVPR
  16. Training Compute-Optimal Large Language Models
  17. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pp.  6700–6709
  18. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
  19. A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), ACL, pp.  2763–2775
  20. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73
  21. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS
  22. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pp.  12888–12900
  23. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, pp.  121–137
  24. Microsoft COCO: common objects in context. In Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), ECCV, volume 8693, pp.  740–755
  25. Decoupled Weight Decay Regularization
  26. MAPL: parameter-efficient adaptation of unimodal pre-trained models for vision-language few-shot prompting. In EACL
  27. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR
  28. Im2text: Describing images using 1 million captioned photographs. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F. C. N., and Weinberger, K. Q. (eds.), NIPS, pp.  1143–1151
  29. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pp.  2641–2649
  30. Learning Transferable Visual Models From Natural Language Supervision
  31. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
  32. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Gurevych, I. and Miyao, Y. (eds.), ACL, pp.  2556–2565
  33. LXMERT: learning cross-modality encoder representations from transformers. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), EMNLP, pp.  5099–5110
  34. Plug-and-play VQA: zero-shot VQA by conjoining large pretrained models with zero training. In EMNLP Findings
  35. Multimodal few-shot learning with frozen language models. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), NeurIPS, pp.  200–212
  36. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), ICML, pp.  23318–23340, 2022a.
  37. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
  38. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
  39. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
  40. FILIP: fine-grained interactive language-image pre-training. In ICLR
  41. CoCa: Contrastive Captioners are Image-Text Foundation Models
  42. Florence: A New Foundation Model for Computer Vision
  43. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, pp.  18102–18112
  44. VinVL: Revisiting Visual Representations in Vision-Language Models
  45. OPT: Open Pre-trained Transformer Language Models

Show All 45