Emergent Mind

Robust fine-tuning of zero-shot models

(2109.01903)
Published Sep 4, 2021 in cs.CV and cs.LG

Abstract

Large pre-trained models such as CLIP or ALIGN offer consistent accuracy across a range of data distributions when performing zero-shot inference (i.e., without fine-tuning on a specific dataset). Although existing fine-tuning methods substantially improve accuracy on a given target distribution, they often reduce robustness to distribution shifts. We address this tension by introducing a simple and effective method for improving robustness while fine-tuning: ensembling the weights of the zero-shot and fine-tuned models (WiSE-FT). Compared to standard fine-tuning, WiSE-FT provides large accuracy improvements under distribution shift, while preserving high accuracy on the target distribution. On ImageNet and five derived distribution shifts, WiSE-FT improves accuracy under distribution shift by 4 to 6 percentage points (pp) over prior work while increasing ImageNet accuracy by 1.6 pp. WiSE-FT achieves similarly large robustness gains (2 to 23 pp) on a diverse set of six further distribution shifts, and accuracy gains of 0.8 to 3.3 pp compared to standard fine-tuning on seven commonly used transfer learning datasets. These improvements come at no additional computational cost during fine-tuning or inference.

WiSE-FT improves robustness and accuracy on distribution shifts and the reference distribution via model interpolation.

Overview

  • The paper introduces Weight-space Ensembling Fine-tuning (WiSE-FT), which involves standard fine-tuning on the target distribution followed by linearly interpolating the weights of the zero-shot model and the fine-tuned model to preserve robustness while incorporating specificity.

  • Experiments on ImageNet and other data distributions show that WiSE-FT improves accuracy under distribution shift by up to 6 percentage points and increases ImageNet accuracy by 1.6 percentage points without additional computational costs.

  • WiSE-FT shows performance improvements not only for CLIP models but also for ALIGN, BASIC, and ViT models. Future research could explore automated selection of interpolation coefficients, application in other domains beyond image classification, and advanced ensembling techniques.

An Expert Analysis of "Robust fine-tuning of zero-shot models"

The paper "Robust fine-tuning of zero-shot models" addresses a significant challenge in leveraging large pre-trained models such as CLIP, ALIGN, and BASIC, which exhibit robust performance across diverse data distributions in their zero-shot settings. The research investigates a novel and practical methodology for fine-tuning these zero-shot models to improve their robustness while maintaining high target distribution accuracy. This is achieved through a technique termed Weight-space Ensembling Fine-tuning (WiSE-FT).

The primary motivation stems from the observation that although current fine-tuning techniques enhance model accuracy on specific target distributions, this often comes at the cost of reduced robustness to distributional shifts. The paper introduces WiSE-FT as a method to preserve robustness by combining the weights of the zero-shot and fine-tuned models via a linear interpolation mechanism.

Key Contributions and Findings:

Methodology of WiSE-FT:

  • The process consists of two stages: standard fine-tuning on the target distribution followed by linearly interpolating the weights of the zero-shot model and the fine-tuned model. This approach aims to harness the robustness of the pre-trained model while incorporating the specificity learned during fine-tuning.
  • Mathematical formalism and empirical studies are conducted to showcase the efficacy of this method regardless of neural networks' inherent non-linearity.

Empirical Performance:

  • On ImageNet and five derived distribution shifts (ImageNet-V2, ImageNet-R, ImageNet Sketch, ObjectNet, and ImageNet-A), WiSE-FT improves accuracy under distribution shift by 4 to 6 percentage points (pp) over prior methods while increasing ImageNet accuracy by 1.6 pp.
  • Similar robustness improvements (ranging from 2 to 23 pp) were observed over a diverse set of distribution shifts, including geographic shifts in satellite imagery and wildlife recognition and temporal perturbations in videos.
  • These enhancements are achieved without additional computational costs during either fine-tuning or inference.

Hyperparameter Sensitivity:

  • WiSE-FT addresses the brittleness in hyperparameter tuning observed in standard fine-tuning approaches. Variations in learning rates, epochs, and regularization significantly impact the robustness performance, which WiSE-FT effectively mitigates.

Broader Applicability:

  • Beyond CLIP models, WiSE-FT shows strong performance improvements when applied to other zero-shot models, including ALIGN, BASIC, and a ViT model pre-trained on JFT. These findings indicate the generalizability and robustness of the proposed method.

Improved Accuracy in Low-Data Regime:

  • WiSE-FT not only demonstrates robustness but also shows improvements in accuracy on the reference distribution. Even in scenarios with scarce fine-tuning data, WiSE-FT outperforms standard fine-tuning.

Implications and Future Directions:

The implications of this research are manifold. Practically, WiSE-FT offers a straightforward and computationally efficient strategy to enhance the robustness of fine-tuned zero-shot models, thus potentially transforming their deployment in real-world applications where data distributions can vary significantly. Theoretically, the study provides insights into the effectiveness of model weights interpolation, connecting to broader themes in convex optimization and neural network phenomenology, such as linear mode connectivity.

The research opens several avenues for future exploration:

Automated Selection of Interpolation Coefficient (α):

Developing methods to automatically select or adapt the mixing coefficient α based on the characteristics of the target data or during training could enhance the practicality of WiSE-FT.

Applicability Across Domains:

Investigating the applicability of WiSE-FT beyond image classification, such as in natural language processing or other domains, can demonstrate its versatility and broader impact.

Complex Weight-space Ensembling Techniques:

Exploring more sophisticated ensembling techniques beyond simple linear interpolation, such as those based on adaptive or learned interpolation strategies, can potentially refine the balance between robustness and accuracy further.

Overall, the findings of this paper provide a strong foundation for improving robustness in fine-tuned models, leveraging the strengths of zero-shot pre-trained representations. This research is a step towards creating more reliable and adaptable machine learning systems capable of performing well across varied and unpredictable data distributions.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.