DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets (2404.02900v1)
Abstract: Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT, we divide the input image into patch tokens and process them through a stack of self attention blocks. However, unlike Convolutional Neural Networks (CNN), ViTs simple architecture has no informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a large amount of data for pre-training. Various data efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However, limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work, we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce an efficient and effective way of distillation from CNN via distillation DIST token by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to the learning of local CNN-like features in early ViT blocks, improving generalization for tail classes. Further, to mitigate overfitting, we propose distilling from a flat CNN teacher, which leads to learning low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme, the distillation DIST token becomes an expert on the tail classes, and the classifier CLS token becomes an expert on the head classes. The experts help to effectively learn features corresponding to both the majority and minority classes using a distinct set of tokens within the same ViT architecture. We show the effectiveness of DeiT-LT for training ViT from scratch on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.
- Quantifying attention flow in transformers. arXiv preprint arXiv:2005.00928, 2020.
- Evaluating clip: towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818, 2021.
- Sharpness-aware minimization leads to low-rank features. arXiv preprint arXiv:2305.16292, 2023.
- Ace: Ally complementary experts for solving long-tailed recognition in one-shot. In ICCV, 2021.
- Learning imbalanced datasets with label-distribution-aware margin loss. In NeurIPS, 2019.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Reltransformer: A transformer-based long-tail visual relationship recognition. In CVPR, 2022.
- Parametric contrastive learning. In ICCV, 2021.
- Class-balanced loss based on effective number of samples. In CVPR, 2019.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE TPAMI, 2015.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
- LVIS: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Deep residual learning for image recognition. In CVPR, 2016.
- Disentangling label distribution for long-tailed visual recognition. In CVPR, 2021.
- Safa:sample-adaptive feature augmentation for long-tailed image classification. In ECCV, 2022.
- Class-balanced distillation for long-tailed visual recognition. 2021.
- Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In CVPR, 2020.
- Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217, 2019.
- Exploring balanced feature spaces for representation learning. In International Conference on Learning Representations, 2020.
- M2m: Imbalanced classification via major-to-minor translation. In CVPR, 2020.
- Label-imbalanced and group-sensitive classification under overparameterization. In Advances in Neural Information Processing Systems, pages 18970–18983. Curran Associates, Inc., 2021.
- Learning multiple layers of features from tiny images. 2009.
- Nested collaborative learning for long-tailed visual recognition. In CVPR, pages 6949–6958, 2022a.
- Long tail visual recognition via gaussian clouded logit adjustment. In CVPR, 2022b.
- Self supervision to distillation for long-tailed visual recognition. In ICCV, 2021.
- Targeted supervised contrastive learning for long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6918–6928, 2022c.
- Large-scale long-tailed recognition in an open world. In CVPR, 2019.
- Retrieval augmented classification for long-tail visual recognition. In CVPR, 2022.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Long-tail learning via logit adjustment. arXiv preprint arXiv:2007.07314, 2020.
- Effectiveness of arbitrary transfer sets for data-free knowledge distillation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1430–1438, 2021.
- Probing toxic content in large pre-trained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4262–4274, 2021.
- Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10428–10436, 2020.
- Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems, 34:12116–12128, 2021.
- Class balancing gan with a classifier in the loop. In Conference on Uncertainty in Artificial Intelligence (UAI), 2021.
- Escaping saddle points for effective generalization on class-imbalanced data. In Advances in Neural Information Processing Systems, pages 22791–22805. Curran Associates, Inc., 2022a.
- Improving gans for long-tailed data through group spectral regularization. In European Conference on Computer Vision (ECCV), 2022b.
- Noisytwins: Class-consistent and diverse image generation through styleGANs. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Balanced meta-softmax for long-tailed visual recognition. arXiv preprint arXiv:2007.10740, 2020.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Parameter-efficient long-tailed recognition. arXiv preprint arXiv:2309.10019, 2023.
- Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7262–7272, 2021.
- Attention-based pedestrian attribute analysis. TIP, 28(12):6126–6140, 2019.
- Vl-ltr: Learning class-wise visual-linguistic representation for long-tailed visual recognition. In European Conference on Computer Vision, pages 73–91. Springer, 2022.
- Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
- Training data-efficient image transformers amp; distillation through attention. In International Conference on Machine Learning, pages 10347–10357, 2021a.
- Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 32–42, 2021b.
- Three things everyone should know about vision transformers. arXiv preprint arXiv:2203.09795, 2022a.
- Deit iii: Revenge of the vit. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 516–533. Springer, 2022b.
- The inaturalist species classification and detection dataset. In CVPR, 2018.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Overwriting pretrained bias with finetuning data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3957–3968, 2023.
- Contrastive learning based hybrid networks for long-tailed image classification. In CVPR, 2021a.
- Long-tailed recognition by routing diverse distribution-aware experts. In ICLR, 2021b.
- Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023a.
- Learning imbalanced data with vision transformers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
- Rethink long-tailed recognition with vision transforms. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023c.
- Identifying and compensating for feature deviation in imbalanced deep learning. arXiv preprint arXiv:2001.01385, 2020.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
- mixup: Beyond empirical risk minimization. In ICLR, 2018.
- Distribution alignment: A unified framework for long-tail visual recognition. In CVPR, 2021a.
- Bag of tricks for long-tailed visual recognition with deep convolutional neural networks. In AAAI, 2021b.
- Improving calibration for long-tailed recognition. In CVPR, 2021.
- Places: A 10 million image database for scene recognition. IEEE TPAMI, 2017.
- Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In CVPR, 2020.
- Imbsam: A closer look at sharpness-aware minimization in class-imbalanced recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11345–11355, 2023.