GTA: Guided Transfer of Spatial Attention from Object-Centric Representations (2401.02656v1)
Abstract: Utilizing well-trained representations in transfer learning often results in superior performance and faster convergence compared to training from scratch. However, even if such good representations are transferred, a model can easily overfit the limited training dataset and lose the valuable properties of the transferred representations. This phenomenon is more severe in ViT due to its low inductive bias. Through experimental analysis using attention maps in ViT, we observe that the rich representations deteriorate when trained on a small dataset. Motivated by this finding, we propose a novel and simple regularization method for ViT called Guided Transfer of spatial Attention (GTA). Our proposed method regularizes the self-attention maps between the source and target models. A target model can fully exploit the knowledge related to object localization properties through this explicit regularization. Our experimental results show that the proposed GTA consistently improves the accuracy across five benchmark datasets especially when the number of training data is small.
- Abssnalyzing the performance of multilayer neural networks for object recognition. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13, pages 329–344. Springer, 2014.
- Masked siamese networks for label-efficient learning. In European Conference on Computer Vision, pages 456–473. Springer, 2022.
- Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3478–3488, 2021.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
- Transmix: Attend to mix for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12135–12144, 2022.
- Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Catastrophic forgetting meets negative transfer: Batch spectral shrinkage for safe transfer learning. Advances in Neural Information Processing Systems, 32, 2019.
- An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- How well do self-supervised models transfer? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5414–5423, 2021.
- The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–308, 2009.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Rethinking imagenet pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4918–4927, 2019.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR workshop on fine-grained visual categorization (FGVC), volume 2. Citeseer, 2011.
- Do better imagenet models transfer better? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2661–2671, 2019.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
- Vision transformer for small-size datasets. arXiv preprint arXiv:2112.13492, 2021.
- Delta: Deep learning transfer using feature map with attention for convolutional networks. In International Conference on Learning Representations, 2019.
- Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12009–12019, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
- Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
- Rectifying the shortcut learning of background for few-shot learning. Advances in Neural Information Processing Systems, 34:13073–13085, 2021.
- Rectify vit shortcut learning by visual saliency. arXiv preprint arXiv:2206.08567, 2022.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34:23296–23308, 2021.
- Andrew Y Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, page 78, 2004.
- How do vision transformers work? In International Conference on Learning Representations, 2022.
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
- Image transformer. In International conference on machine learning, pages 4055–4064. PMLR, 2018.
- deit. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
- Three things everyone should know about vision transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 497–515. Springer, 2022.
- Deit iii: Revenge of the vit. arXiv preprint arXiv:2204.07118, 2022.
- Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 32–42, 2021.
- Caltech-ucsd birds 200. 2010.
- Explicit inductive bias for transfer learning with convolutional networks. In International Conference on Machine Learning, pages 2825–2834. PMLR, 2018.
- Co-tuning for transfer learning. Advances in Neural Information Processing Systems, 33:17236–17246, 2020.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pages 558–567, 2021.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
- Styleswin: Transformer-based gan for high-resolution image generation. In Proceedings of the IEEE/CVF sconference on computer vision and pattern recognition, pages 11304–11314, 2022.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
- Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13001–13008, 2020.
- Image bert pre-training with online tokenizer. In International Conference on Learning Representations, 2022.
- Mugs: A multi-granular self-supervised learning framework. arXiv preprint arXiv:2203.14415, 2022.