Convolutional Bypasses Are Better Vision Transformer Adapters

Published 14 Jul 2022 in cs.CV | (2207.07039v3)

Abstract: The pretrain-then-finetune paradigm has been widely adopted in computer vision. But as the size of Vision Transformer (ViT) grows exponentially, the full finetuning becomes prohibitive in view of the heavier storage overhead. Motivated by parameter-efficient transfer learning (PETL) on language transformers, recent studies attempt to insert lightweight adaptation modules (e.g., adapter layers or prompt tokens) to pretrained ViT and only finetune these modules while the pretrained weights are frozen. However, these modules were originally proposed to finetune LLMs and did not take into account the prior knowledge specifically for visual tasks. In this paper, we propose to construct Convolutional Bypasses (Convpass) in ViT as adaptation modules, introducing only a small amount (less than 0.5% of model parameters) of trainable parameters to adapt the large ViT. Different from other PETL methods, Convpass benefits from the hard-coded inductive bias of convolutional layers and thus is more suitable for visual tasks, especially in the low-data regime. Experimental results on VTAB-1K benchmark and few-shot learning datasets show that Convpass outperforms current language-oriented adaptation modules, demonstrating the necessity to tailor vision-oriented adaptation modules for adapting vision models.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (116)

View on Semantic Scholar

Summary

The paper presents Convpass, a novel adapter that integrates convolutional bypasses into ViT models, adding less than 0.5% trainable parameters.
It leverages convolutional layers alongside attention blocks to inject vital spatial inductive biases missing in language-oriented PETL methods.
Experimental results on VTAB-1K benchmarks and few-shot tasks show Convpass outperforms existing methods, particularly under data-constrained scenarios.

Convolutional Bypasses Are Better Vision Transformer Adapters

This paper addresses the challenges faced in the parameter-efficient transfer learning (PETL) paradigm, especially when adapting large Vision Transformer (ViT) models to downstream visual tasks. With ViT models exponentially growing in size, full fine-tuning becomes impractical due to substantial storage overheads. While PETL strategies—originating in the context of NLP—have been adapted to ViT, these methods often lack visual task-specific enforcement of inductive biases, which the authors argue is a limitation.

The authors propose the use of Convolutional Bypasses (Convpass) as a novel adaptation module tailored for ViT models. Convpass integrates lightweight yet effective convolutional modules into ViT, introducing a hard-coded inductive bias more aligned with visual tasks. Convpass requires the addition of less than 0.5% of trainable parameters relative to the full model, facilitating efficient adaptation even in data-constrained scenarios.

Key Insights and Methodology

Building on observed limitations of language-oriented PETL methods, the paper emphasizes the need for visual inductive biases in ViT adaptation. Existing PETL strategies like Adapters, LoRA, and VPT, although useful, are inherently designed for NLP tasks and do not optimally utilize spatial properties crucial for visual recognition.

The proposed Convpass modules function parallel to Multi-Head Self-Attention (MHSA) and Multi-Layer Perceptron (MLP) blocks within ViT layers. By leveraging convolutional layers, Convpass introduces spatial locality features into the model. This approach effectively reestablishes spatial structures in flattened image token sequences, enabling individual attention to [cls] tokens and image tokens via convolution.

Through extensive experimentation across VTAB-1K benchmarks and few-shot learning datasets, Convpass demonstrates superiority over traditional language-oriented PETL methods. Convpass exhibits enhanced performance in low-data regimes, highlighting its effectiveness in tasks with limited training samples.

Experimental Results

Empirical results on the VTAB-1K benchmark reveal that Convpass consistently outperforms state-of-the-art language-oriented PETL methods across various visual tasks. Convpass achieves significant improvement on average results, outperforming over 75% of existing methods on the benchmark. Similarly, in few-shot learning experiments across fine-grained datasets, Convpass provides robust improvements, reinforcing its data efficiency advantages.

Through analytical contrast with architectures like Swin Transformers and ConvNeXt, which inherently possess visual inductive biases, Convpass further proves its applicability in bridging the inductive bias gap in ViT. The analysis validates that Convpass helps lift ViT to competitive performance levels, often surpassing full fine-tuning of its convolutional counterparts.

Implications and Future Directions

The introduction of vision-oriented adaptation modules sets a precedent in customizing transfer learning techniques for vision transformers. Convpass not only mitigates the parameter inefficiency problem in ViT but also sets a promising path for further exploration into convolution-integrating modules that enhance spatial representation.

As ViT and other transformer models continue to proliferate in visual domains, integrating structured inductive biases will be crucial for handling diverse datasets and tasks efficiently. Future developments could explore hybrid architectures that further refine the balance between efficiency and inductive bias integration, possibly exploring dynamic architectural adjustments based on task-driven needs.

In conclusion, the paper offers a thorough investigation and a compelling case for Convpass as an optimized PETL method tailored specifically for vision transformers. This approach widens the frontier for vision-oriented adaptations and establishes a foundational methodology for subsequent innovations in the efficient finetuning of large-scale visual models.

Markdown Report Issue