Visual Prompt Tuning

Published 23 Mar 2022 in cs.CV | (2203.12119v2)

Abstract: The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, ie, full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning LLMs, VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen. Via extensive experiments on a wide variety of downstream recognition tasks, we show that VPT achieves significant performance gains compared to other parameter efficient tuning protocols. Most importantly, VPT even outperforms full fine-tuning in many cases across model capacities and training data scales, while reducing per-task storage cost.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (1,228)

View on Semantic Scholar

Summary

The paper shows that Visual Prompt Tuning injects a small set of trainable visual prompts, using less than 1% of parameters, to surpass full fine-tuning in 20 of 24 tasks.
It introduces two variants—VPT-Shallow and VPT-Deep—that insert prompts at different layers, enabling efficient adaptation while keeping the backbone frozen.
It demonstrates robustness in low-data settings and scalability across various Transformer architectures and pre-training objectives.

Visual Prompt Tuning

The paper "Visual Prompt Tuning" by Menglin Jia et al. introduces a novel approach termed Visual Prompt Tuning (VPT) for parameter-efficient fine-tuning of large Transformer models in vision. This method is proposed as an alternative to conventional full fine-tuning, which is resource-intensive as it requires updating all model parameters to adapt to new tasks.

Summary

VPT draws inspiration from prompt tuning used in NLP to achieve comparable or even superior performance to full fine-tuning while utilizing a fraction of the parameters. The essence of VPT lies in introducing a small number of trainable parameters into the input space, thus allowing the model backbone to remain frozen during fine-tuning. These parameters, referred to as "prompts", are pre-pended to the input sequence of a Transformer.

The effectiveness of VPT is validated through extensive experiments across 24 downstream tasks, including fine-grained visual classification and a diverse set of tasks from the VTAB-1k benchmark. Results show that VPT can outperform full fine-tuning in 20 out of 24 tasks, with less than 1% of the model parameters being trainable. Additionally, VPT exhibits remarkable performance in low-data regimes and maintains its efficacy across various data scales. The study also demonstrates VPT's applicability to different Transformer architectures, such as ViT and Swin, and its effectiveness across various pre-training objectives and model scales.

Interestingly, VPT challenges previous assumptions in NLP where prompt tuning with smaller parameter footprints only matches, but does not exceed, the performance of full fine-tuning. This paper, however, shows that visual prompts can indeed surpass full fine-tuning, making it a promising advancement in the field of vision Transformers.

Methodology

VPT operates in two main variants:

VPT-Shallow: Prompts are introduced only at the input of the first Transformer layer.
VPT-Deep: Prompts are introduced at the input of every Transformer layer.

Both variants emphasize that the additional parameters (prompts) are learned while keeping the entire pre-trained Transformer backbone frozen. This leads to substantial reductions in storage costs and computational resources needed for adapting large-scale models to new tasks.

Results and Implications

The paper provides rigorous empirical evidence that supports the efficacy of VPT. Key findings include:

Performance Gains: VPT-Deep surpasses full fine-tuning in 20 out of 24 tasks and achieves an average accuracy improvement across multiple benchmarks. It is particularly effective in settings with limited training data, maintaining advantages across different data scales.
Parameter Efficiency: Both VPT-Shallow and VPT-Deep utilize less than 1% of the model's parameters, highlighting their parameter efficiency compared to full fine-tuning.
Scalability: VPT is applicable to various Transformer scales (ViT-Base, Large, Huge) and maintains its benefits as the model size increases.
Robustness: VPT remains effective across different pre-training objectives (supervised and self-supervised) and model types (ViT, Swin).

Future Directions

The promising results of VPT open several avenues for future research:

Broader Application in Vision Tasks: Exploring the applicability of VPT to more complex vision tasks such as object detection and segmentation.
Better Understanding of Prompting Mechanisms: Investigating the fundamental differences between visual and textual prompts and why visual prompts can surpass full fine-tuning.
Optimizing Computational Efficiency: Developing more advanced techniques to reduce computational overhead during inference, especially for VPT with large prompt lengths.
Combining with Other Efficient Tuning Protocols: Exploring hybrid methods that incorporate VPT with other fine-tuning strategies, such as adapter tuning, to further improve performance and efficiency.

Conclusion

The introduction of Visual Prompt Tuning provides a significant step towards efficient adaptation of large vision Transformer models. By leveraging a small set of trainable parameters in the input space and keeping the backbone frozen, VPT achieves competitive or superior performance compared to full fine-tuning. Its robustness across different data regimes, model scales, and pre-training objectives underscores the versatility and potential of VPT as a fine-tuning strategy for large-scale vision models.

Markdown Report Issue