Emergent Mind

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

(2110.04544)
Published Oct 9, 2021 in cs.CV and cs.CL

Abstract

Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in \cite{radford2021learning} to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions.~To avoid non-trivial prompt engineering, context optimization \cite{zhou2021coop} has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples.~In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning.~While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pre-trained features.~As a consequence, CLIP-Adapter is able to outperform context optimization while maintains a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.

Few-shot learning on 11 datasets: CLIP-Adapter outperforms previous baselines across various training shots.

Overview

  • The paper introduces CLIP-Adapter, a new method to enhance vision-language models through fine-tuning lightweight feature adapters rather than relying on prompt tuning.

  • CLIP-Adapter integrates small trainable bottleneck layers into the pre-trained CLIP model, improving performance in few-shot learning scenarios and simplifying the adaptation process compared to prompt-tuning strategies.

  • Extensive empirical validation across eleven classification datasets shows that CLIP-Adapter consistently outperforms baseline models, particularly in data-scarce settings, demonstrating significant performance gains in few-shot scenarios.

CLIP-Adapter: Enhancing Vision-Language Models with Feature Adapters

The paper "CLIP-Adapter: Better Vision-Language Models with Feature Adapters" introduces a novel approach for improving vision-language models by utilizing feature adapters instead of prompt tuning. The authors, Peng Gao et al., propose CLIP-Adapter, which fine-tunes additional light-weight bottleneck layers to the pre-trained CLIP model to enhance its performance in few-shot learning scenarios.

Overview

The CLIP-Adapter leverages the success of CLIP (Contrastive Language-Image Pre-training), which aligns images with textual descriptions using a large-scale dataset of image-text pairs. While CLIP has shown remarkable zero-shot classification capabilities, its dependency on carefully hand-crafted prompts presents a significant limitation. To circumvent the need for prompt engineering, the CLIP-Adapter introduces a fine-tuning mechanism using feature adapters.

Core Contributions

  1. Residual-Style Feature Blending: The CLIP-Adapter introduces a mechanism of residual-style feature blending. This involves adding small trainable bottleneck layers that adjust either the visual or textual representations from the pre-trained CLIP model. The adapted features blend with the original features via residual connections, allowing the model to retain the knowledge from pre-training while incorporating new learning from few-shot examples.
  2. Simplified Adaptation: The proposed method simplifies the design compared to prompt-tuning strategies like CoOp. CLIP-Adapter specifically avoids the intricacies of designing task-specific continuous prompts by focusing on fine-tuning additional lightweight layers. This approach leads to better few-shot classification performance with a less complex adaptation process.
  3. Empirical Validation: The authors validate their method on eleven classification datasets, demonstrating consistent performance improvements over baseline models including zero-shot CLIP, linear probe CLIP, and CoOp. The experiments reveal that CLIP-Adapter achieves superior results, particularly in data-scarce scenarios such as 1-shot and 2-shot settings.

Detailed Analysis of Experimental Results

The experiments, conducted under various few-shot settings (1, 2, 4, 8, 16 shots), demonstrate CLIP-Adapter's significant performance gains, particularly in comparison to zero-shot CLIP and CoOp. For example, the absolute performance improvements on fine-grained datasets such as EuroSAT and DTD range from 20% to 50% under the 16-shot setting. These results highlight the model’s robustness across different domains.

The study also explores the residual hyperparameter $\alpha$, showing that optimal values vary depending on the dataset characteristics. Fine-tuning on fine-grained datasets tends towards higher values of $\alpha$, signifying a need for more adaptation from the new examples. Conversely, generic datasets like ImageNet benefit from a lower $\alpha$, suggesting substantial retention of pre-trained knowledge.

Theoretical and Practical Implications

Theoretically, CLIP-Adapter supports the idea that vision-language models can benefit from a hybrid approach that complements zero-shot learning with targeted adaptation. This addresses the limitations posed by prompt engineering and extends the model's applicability across diverse tasks without intensive manual tuning.

Practically, CLIP-Adapter's ability to efficiently handle few-shot learning makes it particularly valuable for applications where large annotated datasets are unavailable. Use cases could range from medical imaging to satellite imagery classification, where labeled data is typically scarce.

Prospective Future Work

Future directions include extending CLIP-Adapter beyond classification to other vision-language tasks such as object detection, image captioning, and visual question answering. Additionally, integrating CLIP-Adapter with other forms of prompt tuning might unleash the full potential of vision-language models by combining adaptable feature learning with dynamic prompt design.

In summary, CLIP-Adapter presents a compelling alternative to prompt tuning, offering a simplified yet effective method for advancing vision-language models. By fine-tuning feature adapters, the approach demonstrates significant improvements in few-shot learning scenarios while maintaining a straightforward implementation. This work lays a foundation for future advancements in adaptive learning frameworks in the field of AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.