A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation

Published 6 Feb 2024 in cs.CV, cs.AI, and cs.LG | (2402.04087v1)

Abstract: Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficient fine-tuning methods, such as prompt learning and adapter, to enhance CLIP's performance in downstream tasks. However, these methods still require additional training time and computational resources, which is undesirable for devices with limited resources. In this paper, we revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP. Typically, GDA assumes that features of each class follow Gaussian distributions with identical covariance. By leveraging Bayes' formula, the classifier can be expressed in terms of the class means and covariance, which can be estimated from the data without the need for training. To integrate knowledge from both visual and textual modalities, we ensemble it with the original zero-shot classifier within CLIP. Extensive results on 17 datasets validate that our method surpasses or achieves comparable results with state-of-the-art methods on few-shot classification, imbalanced learning, and out-of-distribution generalization. In addition, we extend our method to base-to-new generalization and unsupervised learning, once again demonstrating its superiority over competing approaches. Our code is publicly available at \url{https://github.com/mrflogs/ICLR24}.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces a training-free CLIP adaptation that integrates Gaussian Discriminant Analysis, eliminating the need for costly fine-tuning.
It achieves an average improvement of 2.82% over state-of-the-art baselines in few-shot learning across 17 diverse datasets.
The approach extends to base-to-new generalization and unsupervised settings, demonstrating robust performance on imbalanced and out-of-distribution tasks.

A Hard-to-Beat Baseline for Training-Free CLIP-Based Adaptation

This paper presents a novel method for training-free adaptation of the Contrastive Language-Image Pretraining (CLIP) model leveraging a classical algorithm, Gaussian Discriminant Analysis (GDA). The method stands out by eliminating the need for additional training, cutting down on computational costs while aiming to achieve results comparable to or better than state-of-the-art trained approaches in downstream tasks. The authors validate their approach through extensive experiments across various visual tasks, such as few-shot classification, imbalanced learning, and out-of-distribution generalization.

Methodology

The paper revisits Gaussian Discriminant Analysis (GDA), a traditional probabilistic model typically used for classification tasks where features of each class are assumed to follow Gaussian distributions with identical covariance. The authors effectively apply GDA to the zero-shot scenarios foundational to CLIP, estimating class means and covariance matrices from the data directly. This method sidesteps the resource-intensive model optimization routines such as stochastic gradient descent by crafting classifiers from empirical data properties alone. To integrate visual and textual modalities, the paper envisions an ensemble approach combining the GDA-based classifier with the zero-shot classifier provided by CLIP.

Two extended variants of the proposed approach are tailored towards base-to-new generalization and unsupervised learning. For base-to-new generalization, the authors use a K-Nearest-Neighbors (KNN) strategy to synthesize samples for novel classes based on statistical similarity and extend the GDA framework to those classes. In unsupervised settings, an Expectation-Maximization (EM) strategy is employed under the assumption of Gaussian mixture distributions, allowing estimation of means and covariances from the unlabeled data.

Experimental Results

The results reflect that the proposed GDA-based method performs robustly across 17 datasets, demonstrating superiority over CLIP's out-of-the-box zero-shot classification while maintaining competitive performance against fine-tuned models. Specifically, in the few-shot learning paradigm, the method exceeds state-of-the-art training-free baselines by an average improvement of 2.82% across most datasets, achieving results comparable to training-required methods. For imbalanced learning scenarios, the approach enhances medium and few-shot class performance, outperforming even those fully fine-tuned models. Extensions to generalize across new classes and leverage unlabeled data further confirm the versatility and potential applicability of the model.

Implications and Future Directions

This study offers a significant contribution in making large-scale pretrained models like CLIP more accessible in constrained resource settings by removing the need for retraining. The approach underscores practical implications in edge computing, where computational resources are limited. Theoretically, such adaptation could contribute to more robust generalization over diverse datasets without the exhaustive tuning of model weights.

Future work may consider the application of this method in dense prediction tasks, exploring its potential in segmentation or detection where pretraining adaptations are commonly needed. Moreover, incorporating adaptive methods to fine-tune the covariance estimation from limited data through sophisticated algorithms could refine and further bolster the model's performance.

The paper presents a decisive step in developing efficient methodologies for leveraging pretrained architectures, expanding capabilities while conserving essential computational resources. The results inspire deeper exploration into harnessing statistical data properties for improving machine learning model adaptations, shaping a promising trajectory for future research in AI and computer vision.

Markdown Report Issue