Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Published 28 Feb 2023 in cs.CV | (2302.14794v1)

Abstract: Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen LLMs, but rely on hand-engineered task induction to reduce the hypothesis space. To make the whole process learnable, we introduce a multimodal meta-learning approach. Specifically, our approach decomposes the training of the model into a set of related multimodal few-shot tasks. We define a meta-mapper network, acting as a meta-learner, to efficiently bridge frozen large-scale vision and LLMs and leverage their already learned capacity. By updating the learnable parameters only of the meta-mapper, it learns to accrue shared meta-knowledge among these tasks. Thus, it can rapidly adapt to newly presented samples with only a few gradient updates. Importantly, it induces the task in a completely data-driven manner, with no need for a hand-engineered task induction. We evaluate our approach on recently proposed multimodal few-shot benchmarks, measuring how rapidly the model can bind novel visual concepts to words and answer visual questions by observing only a limited set of labeled examples. The experimental results show that our meta-learning approach outperforms the baseline across multiple datasets and various training settings while being computationally more efficient.

Abstract PDF Upgrade to Chat

Citations (18)

View on Semantic Scholar

Summary

The paper introduces a novel meta-learning framework that bridges fixed vision and language models using a trainable meta-mapper.
It employs bi-level optimization with episodic training, achieving significant performance improvements on few-shot visual and VQA benchmarks.
The method is computationally efficient with only ~2M trainable parameters, enabling rapid adaptation to multimodal tasks.

Meta-Learning Framework for Multimodal Few-Shot Learning

Introduction

The intersection of vision and LLMs in the context of few-shot learning presents considerable challenges, particularly due to the significant modality gap and difficulties in forming data-efficient, generalizable bindings between visual and linguistic concepts. The paper "Meta Learning to Bridge Vision and LLMs for Multimodal Few-Shot Learning" (2302.14794) offers a principled meta-learning approach to connect large-scale, frozen vision and LLMs. It does so via a trainable, lightweight meta-mapper, which accrues shared meta-knowledge across multimodal few-shot tasks, thus inducing tasks in a purely data-driven, non-handcrafted manner. This essay summarizes the objectives, methodology, findings, and implications of this work.

Problem Formulation and Proposed Method

The target is to facilitate rapid adaptation to novel multimodal tasks from only a few labeled examples—mirroring human cognitive efficiency, but addressing the largely unexamined domain of meta-learning in multimodal settings. Existing solutions typically use hand-engineered prompts to communicate visual concepts to frozen LMs, which limits scalability and flexibility. The proposed method circumvents this by constructing an episodic, meta-learning pipeline in which both the visual encoder and the LLM are kept fixed, and only a small meta-mapper network is trained.

The backbone comprises three modules:

Frozen Vision Encoder: A pre-trained CLIP ViT/B-32 transformer generates visual features, which are semantically aligned with textual representations.
Frozen LLM: A GPT-2 instance serves as the text generator, maintaining parameter integrity by being frozen during meta-training and inference.
Meta-Mapper: This lightweight network, implemented as stack(s) of set self-attention layers, projects vision embeddings into a latent 'visual prefix' that conditions the LLM. Its parameters are meta-learned to accrue cross-task transferable knowledge.

Within a meta-learning configuration, each episodic task consists of a support set (seen for fast adaptation) and a query set (for evaluation), thereby following an optimization-based meta-learning paradigm akin to MAML [Finn et al., 2017].

Training Strategy

Meta-training involves jointly sampling batches of episodic tasks and performing bi-level optimization: inner-loop adaptation (task-specific fine-tuning of the meta-mapper with support data) and outer-loop meta-update (gradient step on the meta-parameters w.r.t. the performance on query data post-adaptation). This procedure is realized without updating the (large) frozen vision/language encoders, yielding significant computational savings and promoting extensibility.

Crucially, the learned meta-mapper network takes as input both the visual prefix parameters and the visual features, and implements permutation-invariant set attention [Lee et al., 2019] to induce an optimal mapping for each task.

Experimental Results

Experiments are conducted on reformulations of standard benchmarks (e.g., COCO2017, Real-Name miniImageNet, Real-Fast VQA, Fast-VQA), structuring them into $N$ -way, $k$ -shot episodic tasks for both in-domain and cross-domain few-shot evaluation. The key findings include:

Numerical Performance: The proposed meta-learning model consistently outperforms the baseline Frozen model [Tsimpoukelli et al., 2021] on several metrics, with substantial improvements on both caption binding (miniImageNet 2-way/5-way, 1-shot/5-shot) and open-ended VQA tasks. For example, under cross-domain conditions, accuracy on Real-Name 2-way 1-shot increases from 33.7% (Frozen) to 48.2% (Ours).
Ablation Analyses: Disabling the accumulation of meta-knowledge in the meta-mapper, or using an MLP instead of self-attention, yields precipitous drops in performance (the model essentially fails without meta-knowledge). Self-attention is shown to be critical compared to MLP architectures.
Data-Driven Task Induction: The system matches or slightly outperforms engineered task induction simply by fine-tuning on support sets, indicating that data-driven, meta-learned representations generalize effectively without explicit prompting.
Efficiency: With ~2M trainable parameters, the proposed framework is orders of magnitude smaller than competing approaches (e.g., Flamingo [Alayrac et al., 2022]) and can be trained on standard GPUs in under two hours.
Generalization: The approach generalizes across datasets, showing strong transfer learning capabilities, although syntactic evaluation metrics may underestimate true performance due to the model’s tendency to generate valid paraphrases and richer descriptions.

Theoretical and Practical Implications

The study advances several theoretical insights:

Modality Bridging: By restricting training to a small meta-mapper and leveraging fixed, pre-trained encoders, the method demonstrates that meta-learned bridges can efficiently facilitate rapid adaptation and robust conditional generation with minimal labeled data.
Prompt-Less Task Induction: Unlike approaches relying on prompt engineering or context construction, the method shows that task induction can emerge in a purely data-driven, self-attention-mediated way.
Computational Practicality: The low resource cost broadens the applicability of large-scale vision-language reasoning systems in practical settings, including those with severely constrained data regimes.

On the practical side, such frameworks can enable robust few-shot learning in robotics, medical imaging, or any scenario requiring rapid adaptation to new multimodal tasks with minimal labeled supervision.

Limitations and Future Directions

While the framework achieves strong empirical gains, it remains limited by the inherent biases of the underlying frozen models (e.g., CLIP, GPT-2) and the open-vocabulary nature of its generative design. Also, the current study does not extend to additional modalities such as audio or video, though the modular architecture is amenable to such extensions. Evaluation metrics relying on ground-truth word match may not adequately capture the qualitative advancements of the approach. Further research should explore reference-free evaluation schemes (e.g., CLIPScore [Hessel et al., 2021]) and extend the framework to more complex generative tasks and modalities.

Conclusion

The meta-learning approach introduced in this work establishes a competitive paradigm for multimodal few-shot learning by bridging large frozen vision and LLMs with a learnable, attention-based meta-mapper. The resultant system induces task semantics in a prompt-free, data-driven way, yields strong performance on recognized benchmarks, and is computationally efficient. These findings position meta-learning as a viable and extensible foundation for future research in scalable and adaptive multimodal reasoning (2302.14794).