Training data-efficient image transformers & distillation through attention

Published 23 Dec 2020 in cs.CV | (2012.12877v2)

Abstract: Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (5,855)

View on Semantic Scholar

Summary

The paper presents a novel teacher-student distillation strategy using a distillation token to transfer knowledge from convolutional networks to vision transformers.
It demonstrates that DeiT achieves 83.1% top-1 accuracy on ImageNet with an 86M parameter model, reducing training time to under three days on a single machine.
The approach enables efficient transformer deployment in vision tasks and promotes further research into token-based learning mechanisms and hybrid architectures.

Training Data-Efficient Image Transformers Through Attention-Based Content Distillation

Overview

The paper presents an innovative approach to making Vision Transformers (ViTs) significantly more data-efficient, thus enabling their use in scenarios with limited computational resources. The authors introduce "Data-efficient image Transformers" (DeiT), which achieve state-of-the-art results using only the ImageNet dataset for training, deviating from previous ViT models that rely on massive private datasets. The paper outlines critical technical advancements such as a teacher-student training strategy incorporating a distillation token. This strategy allows the transformer to learn from a convolutional neural network (CNN) through attention mechanisms, aligning it with convnets in terms of computational efficiency and accuracy.

Key Contributions

Training Efficiency: DeiT achieves 83.1% top-1 accuracy on ImageNet (single-crop) with an 86M parameter model trained in under three days on a single machine. This is a significant reduction in training resources compared to earlier ViT models.
Distillation Strategy: The authors introduce a novel token-based distillation strategy where a distillation token is employed. This token ensures that the student transformer learns from the teacher (convnet) through attention mechanisms. This approach is particularly successful, yielding substantial gains in benchmark performance.
Competitive Performance: The distilled DeiT model achieves up to 85.2% top-1 accuracy on ImageNet, making it competitive with state-of-the-art CNNs both on ImageNet and when transferred to other popular tasks.
Open-Source Contribution: The authors provide access to their code and models, facilitating the reproduction of results and further exploration by other researchers.

Technical Insights

Vision Transformers (ViT)

ViTs treat image classification as a sequence prediction problem akin to natural language processing tasks. They divide an image into patches and process these patches with a conventional transformer architecture. Despite their remarkable performance, ViTs typically require large datasets (e.g., JFT-300M) to reach their full potential, making them computationally expensive.

Distillation Strategy

The paper's innovative distillation token method merges the pedagogical approach of transfer learning with the architectural strengths of transformers. The distillation token interacts with both class and patch tokens inside the transformer's attention layers. At the final layer, it aims to replicate the labels predicted by the teacher network, thus providing a continuous learning signal from the pre-trained convnet teacher.

Hyperparameter Optimization and Augmentation

The paper emphasizes rigorous hyperparameter optimization and strong data augmentation techniques, employing methods like Rand-Augment, Mixup, CutMix, and random erasing. These augmentations are crucial for improving the model’s capacity to generalize from limited data.

Experimental Validation

The paper reports extensive experiments to validate the proposed approach. Some key findings include:

Efficiency Gains: DeiT models are trained efficiently on standard hardware setups. For example, the largest DeiT model (DeiT-B) completes training in around 53 hours on a single machine.
Superior Performance with Distillation: The distillation token leads to substantial accuracy gains. Notably, a convnet teacher proves to be more effective than a transformer teacher, possibly due to the inductive biases provided by convolutional layers.
Flexibility in Resolution: The approach supports training at one resolution and fine-tuning at higher resolutions, which can further boost performance.

Implications and Future Perspectives

The implications of this work are profound for both the theoretical understanding of transformers in computer vision and practical applications:

Practical Deployment: The reduced need for extensive datasets and computational power makes ViTs more accessible for real-world applications, particularly where resources are limited.
Theoretical Insights: The success of distillation tokens suggests avenues for further research into token-based learning mechanisms and their applications across various domains.
Future Developments: Future research could explore personalized data augmentation strategies or hybrid architectures incorporating both convolutional and transformer elements, potentially leading to even more efficient and robust models.

Conclusion

The authors' contributions to the development of data-efficient vision transformers mark a significant advancement in the field. By leveraging innovative training techniques and rigorous experimentation, DeiT models demonstrate impressive performance, efficiency, and practicality. This work is a critical step towards democratizing the use of transformers in vision tasks, presenting substantial opportunities for further research and application.

Markdown Report Issue