BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (2301.12597v3)

Published 30 Jan 2023 in cs.CV

Abstract: The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen LLMs. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen LLM. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

References (45)

Citations (3,282)

View on Semantic Scholar

Summary

The paper introduces a novel two-stage vision-language pre-training framework that uses frozen unimodal models and a lightweight querying transformer to reduce computational cost.
It achieves state-of-the-art performance on tasks like zero-shot VQA, image captioning, and image-text retrieval while using 54x fewer trainable parameters than comparable methods.
The paper demonstrates that leveraging stronger frozen image encoders and larger language models enhances scalability, enabling novel zero-shot image-to-text generation capabilities.

This paper introduces BLIP-2, a novel vision-language pre-training (VLP) framework designed to be efficient and effective by leveraging readily available, pre-trained unimodal models. The core challenge addressed is the prohibitive computational cost associated with traditional end-to-end VLP methods that train large vision and LLMs from scratch.

BLIP-2 proposes using frozen pre-trained image encoders and frozen LLMs, significantly reducing the number of trainable parameters and pre-training time. The key innovation is the Querying Transformer (Q-Former), a lightweight module designed to bridge the modality gap between the frozen image encoder and the frozen LLM.

The Q-Former employs a set of learnable query embeddings to interact with the frozen image encoder via cross-attention, extracting visual features relevant to the language modality. It acts as an information bottleneck, feeding concise and relevant visual information to the LLM.

BLIP-2's pre-training strategy for the Q-Former consists of two stages:

Vision-Language Representation Learning: The Q-Former is connected to the frozen image encoder and trained on image-text pairs. It optimizes three objectives simultaneously:
- Image-Text Contrastive Learning (ITC): Aligns image and text representations by maximizing mutual information using contrastive loss.
- Image-grounded Text Generation (ITG): Trains the Q-Former to output visual representations that enable the generation of the associated text. This forces the queries to capture text-relevant visual information.
- Image-Text Matching (ITM): A binary classification task to predict if an image-text pair matches, learning fine-grained alignment. This stage teaches the Q-Former to extract visual features that are most informative for the corresponding text.
Vision-to-Language Generative Learning: The Q-Former (with its connected frozen image encoder) is connected to a frozen LLM. A linear layer projects the Q-Former's output query embeddings to match the LLM's input dimension. These projected embeddings act as soft visual prompts prepended to the text input, conditioning the LLM's generation. This stage trains the Q-Former (and the projection layer) so that its visual representations can be interpreted by the frozen LLM, using standard language modeling objectives (e.g., causal LM for decoder models like OPT, prefix LM for encoder-decoder models like FlanT5).

Key Contributions and Findings:

Efficiency: BLIP-2 drastically reduces the number of trainable parameters compared to end-to-end methods (e.g., 54x fewer than Flamingo80B while outperforming it on zero-shot VQAv2). Pre-training is significantly faster.
Performance: Achieves state-of-the-art results on various vision-language tasks, including zero-shot VQA, image captioning (especially on out-of-domain datasets like NoCaps), and image-text retrieval.
Emerging Capabilities: Enables zero-shot instructed image-to-text generation by prompting the frozen LLM with both the visual prompt (from Q-Former) and a text instruction. This allows for tasks like visual conversation, visual knowledge reasoning, and personalized captioning.
Modularity and Scalability: Demonstrates that performance improves when using stronger frozen image encoders (e.g., ViT-g vs. ViT-L) or larger/better LLMs (e.g., FlanT5 vs. OPT, larger variants within families), validating its ability to leverage future advances in unimodal models.
Importance of Representation Learning: Ablation studies show that the first pre-training stage is crucial for effective learning and preventing catastrophic forgetting in the LLM during the second stage.

Implementation Details:

The Q-Former uses a BERT-base initialization (188M parameters) with randomly initialized cross-attention layers. It typically uses 32 learnable queries.
Experiments utilize ViT-L/14 and ViT-g/14 as frozen image encoders and OPT and FlanT5 families as frozen LLMs.
Training uses standard large-scale image-text datasets (COCO, VG, CC, SBU, LAION) with caption filtering.

Limitations:

The model did not exhibit strong in-context few-shot learning capabilities, potentially due to the pre-training data format (single image-text pairs).
It inherits potential risks from the frozen LLMs, such as generating inaccurate, biased, or harmful content.

In conclusion, BLIP-2 presents an efficient and effective method for vision-language pre-training by bootstrapping from frozen unimodal models via a lightweight Q-Former trained in two stages, achieving strong performance and enabling novel zero-shot instructed generation capabilities.