CLIP-Adapter: Better Vision-Language Models with Feature Adapters (2110.04544v2)

Published 9 Oct 2021 in cs.CV and cs.CL

Abstract: Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in \cite{radford2021learning} to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions.~To avoid non-trivial prompt engineering, context optimization \cite{zhou2021coop} has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples.~In this paper, we show that there is an alternative path to achieve better vision-LLMs other than prompt tuning.~While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pre-trained features.~As a consequence, CLIP-Adapter is able to outperform context optimization while maintains a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach. Code is released at t https://github.com/gaopengcuhk/CLIP-Adapter.

Citations (777)

View on Semantic Scholar

Summary

The paper introduces a novel CLIP-Adapter that fine-tunes lightweight bottleneck layers to enhance few-shot learning in vision-language models.
It employs residual-style feature blending to integrate new learning with pre-trained representations without relying on complex prompt engineering.
Empirical results on 11 datasets show significant performance gains over zero-shot CLIP and other methods, particularly in data-scarce settings.

CLIP-Adapter: Enhancing Vision-LLMs with Feature Adapters

The paper "CLIP-Adapter: Better Vision-LLMs with Feature Adapters" introduces a novel approach for improving vision-LLMs by utilizing feature adapters instead of prompt tuning. The authors, Peng Gao et al., propose CLIP-Adapter, which fine-tunes additional light-weight bottleneck layers to the pre-trained CLIP model to enhance its performance in few-shot learning scenarios.

Overview

The CLIP-Adapter leverages the success of CLIP (Contrastive Language-Image Pre-training), which aligns images with textual descriptions using a large-scale dataset of image-text pairs. While CLIP has shown remarkable zero-shot classification capabilities, its dependency on carefully hand-crafted prompts presents a significant limitation. To circumvent the need for prompt engineering, the CLIP-Adapter introduces a fine-tuning mechanism using feature adapters.

Core Contributions

Residual-Style Feature Blending: The CLIP-Adapter introduces a mechanism of residual-style feature blending. This involves adding small trainable bottleneck layers that adjust either the visual or textual representations from the pre-trained CLIP model. The adapted features blend with the original features via residual connections, allowing the model to retain the knowledge from pre-training while incorporating new learning from few-shot examples.
Simplified Adaptation: The proposed method simplifies the design compared to prompt-tuning strategies like CoOp. CLIP-Adapter specifically avoids the intricacies of designing task-specific continuous prompts by focusing on fine-tuning additional lightweight layers. This approach leads to better few-shot classification performance with a less complex adaptation process.
Empirical Validation: The authors validate their method on eleven classification datasets, demonstrating consistent performance improvements over baseline models including zero-shot CLIP, linear probe CLIP, and CoOp. The experiments reveal that CLIP-Adapter achieves superior results, particularly in data-scarce scenarios such as 1-shot and 2-shot settings.

Detailed Analysis of Experimental Results

The experiments, conducted under various few-shot settings (1, 2, 4, 8, 16 shots), demonstrate CLIP-Adapter's significant performance gains, particularly in comparison to zero-shot CLIP and CoOp. For example, the absolute performance improvements on fine-grained datasets such as EuroSAT and DTD range from 20% to 50% under the 16-shot setting. These results highlight the model’s robustness across different domains.

The paper also explores the residual hyperparameter $\alpha$ , showing that optimal values vary depending on the dataset characteristics. Fine-tuning on fine-grained datasets tends towards higher values of $\alpha$ , signifying a need for more adaptation from the new examples. Conversely, generic datasets like ImageNet benefit from a lower $\alpha$ , suggesting substantial retention of pre-trained knowledge.

Theoretical and Practical Implications

Theoretically, CLIP-Adapter supports the idea that vision-LLMs can benefit from a hybrid approach that complements zero-shot learning with targeted adaptation. This addresses the limitations posed by prompt engineering and extends the model's applicability across diverse tasks without intensive manual tuning.

Practically, CLIP-Adapter's ability to efficiently handle few-shot learning makes it particularly valuable for applications where large annotated datasets are unavailable. Use cases could range from medical imaging to satellite imagery classification, where labeled data is typically scarce.

Prospective Future Work

Future directions include extending CLIP-Adapter beyond classification to other vision-language tasks such as object detection, image captioning, and visual question answering. Additionally, integrating CLIP-Adapter with other forms of prompt tuning might unleash the full potential of vision-LLMs by combining adaptable feature learning with dynamic prompt design.

In summary, CLIP-Adapter presents a compelling alternative to prompt tuning, offering a simplified yet effective method for advancing vision-LLMs. By fine-tuning feature adapters, the approach demonstrates significant improvements in few-shot learning scenarios while maintaining a straightforward implementation. This work lays a foundation for future advancements in adaptive learning frameworks in the field of AI.