CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet

Published 12 Dec 2022 in cs.CV and cs.LG | (2212.06138v1)

Abstract: Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset . These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP. We will release our code publicly at \url{https://github.com/LightDXY/FT-CLIP}.

Abstract PDF Upgrade to Chat

Authors (10)

Citations (29)

View on Semantic Scholar

Summary

The paper shows that fine-tuning CLIP with ViT-B/16 and ViT-L/14 reaches 85.7% and 88.0% Top-1 accuracy on ImageNet.
The study employs precise hyper-parameter tuning, including low learning rates, layer-wise decay, and exponential moving averages to optimize performance.
The results challenge the assumption of CLIP as only a zero-shot model, highlighting its potential as a robust fine-tuning baseline for vision tasks.

Evaluation of CLIP's Fine-Tuning Capabilities on ImageNet

The paper "CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet" investigates the fine-tuning capabilities of the CLIP model, a paradigm-shifting vision-LLM developed to excel in zero-shot learning scenarios. This particular research challenges the prevailing opinion that CLIP is unsuitable for fine-tuning by illustrating that minor adjustments in hyper-parameters can significantly enhance its performance.

Key Contributions and Findings

This research primarily focuses on fine-tuning CLIP, specifically its Vision Transformer (ViT) components—ViT-Base/16 and ViT-Large/14—on the ImageNet-1K dataset. The paper meticulously dissects the fine-tuning process and discusses various strategies to enhance performance. Key experimental improvements include optimal selection of learning rates, the application of exponential moving averages (EMAs), and layer-wise learning rate decay (LLRD). Notably, the paper emphasizes hyper-parameter tuning as a pivotal factor, showcasing how varied hyper-parameter configurations determine CLIP's ability to achieve high fine-tuning accuracy.

Hyper-Parameter Tuning: Effective fine-tuning is found to heavily rely on the choice of hyper-parameters, particularly learning rates. The strategy of using a small base learning rate coupled with LLRD proved to be crucial for maintaining the robustness of lower layers while adapting the higher layers more extensively.
Performance Benchmarks: Through empirical analysis, the paper reports an 85.7% Top-1 accuracy on ImageNet-1K for CLIP with ViT-B/16 and 88.0% with ViT-L/14. These results contend with, and even surpass, other preeminent methods involving large-scale supervised pre-training or recent masked image modeling techniques that leverage CLIP for fine-tuning.
Role of Data Augmentation: The findings indicate that weaker augmentations lead to better fine-tuning results, reinforcing the robustness of CLIP’s foundational learning. The removal of strong augmentations like MixUp and CutMix highlights CLIP’s existing handling of data without the need for intense transformations.

Implications and Future Outlook

The study’s outcome—showing that CLIP can achieve state-of-the-art performance solely by fine-tuning—has significant implications for the understanding and utility of pre-trained models in the field. This shifts the narrative from solely leveraging CLIP for zero-shot tasks to considering its efficacy in supervised benchmark scenarios as well. This insight will inform the development of future vision-LLMs and refine currently held assumptions about MIM methods that position CLIP as a teacher.

In terms of practical applications, the suggestions from this paper may influence developing computation-efficient models where fine-tuning rather than extensive supervised training becomes predominant, potentially reducing resource intensity in model deployment.

Moreover, the demonstration of CLIP's refined potential invites further exploration into extending these fine-tuning strategies to other foundational models like Florence and OmniVL. The corroboration of CLIP’s capabilities on different resolutions further sets a precedent for re-evaluating the balance between model size, resolution, and data scale in training regimes.

Overall, this meticulous exploration of CLIP's fine-tuning provides a robust methodology that can serve as a baseline for future work, prompting reconsideration of recent model improvement frameworks that have been built upon CLIP’s architecture.

Markdown Report Issue