ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts

Published 16 Jun 2024 in cs.CV and cs.AI | (2406.10973v4)

Abstract: Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 8% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach over other baselines such as PEFT. Code is available on the project website: https://samar-khanna.github.io/ExPLoRA/

Abstract PDF HTML Upgrade to Chat

Authors (4)

Summary

The paper introduces ExPLoRA, a parameter-efficient method that extends pre-training through selective unfreezing and low-rank adaptation for Vision Transformers.
It achieves near state-of-the-art performance with only 6% of the parameters, delivering an 8.2% boost in linear probing top-1 accuracy on satellite benchmarks.
The study underscores the importance of tuning deeper ViT layers to capture global semantic features while reducing computational costs in domain shift scenarios.

An Evaluation of ExPLoRA: Parameter-Efficient Adaptation of Vision Transformers

Understanding the efficacy of extending pre-training across domain shifts is a profound concern in the computer vision community. The paper under review introduces ExPLoRA, a method devised to address this concern for Vision Transformers (ViTs). The method builds upon existing work in Parameter-Efficient Fine-Tuning (PEFT) by extending unsupervised pre-training to accommodate domain shifts without supervised labels, focusing primarily on domains such as satellite imagery.

Vision Transformers have risen to prominence for their ability to learn intricate patterns from large datasets through self-supervised learning strategies like DinoV2 and MAE. However, the direct application of these models to domains that diverge significantly from their original training data often results in suboptimal performance. ExPLoRA addresses this performance gap by efficiently adapting these pre-trained models to new domains, leveraging the existing capacity of the foundational model while minimizing computational overhead.

The core innovation of ExPLoRA lies in selectively unfreezing 1-2 ViT blocks before applying Low-Rank Adaptation (LoRA) to other layers. This strategic unfreezing, complemented by LoRA's parameter-efficient tuning, ensures that ExPLoRA maintains a compact set of pre-trained weights during the adaptation phase. Notably, the method achieves nearly state-of-the-art accuracy on downstream tasks while using a substantially lesser number of parameters compared to full pre-training alternatives.

The results yielded from real-world datasets like satellite imagery demonstrate ExPLoRA's competitive edge. For instance, in the fMoW-RGB benchmark, it surpasses fully tuned models while operating with only 6% of their parameter requirements. Moreover, a remarkable 8.2% increase in linear probing top-1 accuracy over existing methods underscores the enhanced feature extraction capability of ExPLoRA across domain shifts.

Significantly, the paper's investigation into varied cross-domain scenarios highlights the adaptability of ExPLoRA. Through methodical ablations and insights into ViT layer functionalities, the authors ascertain the importance of tuning deeper layers in the ViT hierarchy to capture more global, semantic information. Such detailed analysis not only validates ExPLoRA’s strategy of differential unfreezing and LoRA-tuning but also extends its applicability to a broad spectrum of visual domains beyond natural images.

Theoretical implications of ExPLoRA are vivid in its demonstration of the potential to reduce computational expenses while retaining, or even improving, performance on domain-specific tasks. Practically, this means that researchers and practitioners can viably transfer highly knowledgeable models from one domain to another, leveraging only a fraction of the resources that traditional methods would necessitate.

Future research may explore the nuanced relationship between the low-rank updates and the feature representations learned during this efficient transfer process. Moreover, the applicability of ExPLoRA's principles to other modalities or architectures remains a fertile ground for exploration. Thus, ExPLoRA not only contributes a robust method for ViTs under domain shifts but also spurs further inquiry into efficient deep learning paradigms.

Markdown Report Issue