Emergent Mind

DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

(2404.02900)
Published Apr 3, 2024 in cs.CV , cs.AI , and cs.LG

Abstract

Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT, we divide the input image into patch tokens and process them through a stack of self attention blocks. However, unlike Convolutional Neural Networks (CNN), ViTs simple architecture has no informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a large amount of data for pre-training. Various data efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However, limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work, we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce an efficient and effective way of distillation from CNN via distillation DIST token by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to the learning of local CNN-like features in early ViT blocks, improving generalization for tail classes. Further, to mitigate overfitting, we propose distilling from a flat CNN teacher, which leads to learning low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme, the distillation DIST token becomes an expert on the tail classes, and the classifier CLS token becomes an expert on the head classes. The experts help to effectively learn features corresponding to both the majority and minority classes using a distinct set of tokens within the same ViT architecture. We show the effectiveness of DeiT-LT for training ViT from scratch on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.

DeiT-LT overview: Head Expert uses CE loss, Tail Expert uses DRW loss from ResNet teacher.

Overview

  • Introduces DeiT-LT, an innovative training framework for Vision Transformers (ViTs) on long-tailed datasets.

  • Features a novel distillation technique and re-weighting of distillation loss to improve ViTs performance on underrepresented classes.

  • Highlights the dual expertise of ViTs - with distinct tokens for head and tail classes, enhancing class representation balance.

  • Demonstrates significant performance improvements on various datasets, particularly on those with severe class imbalances.

DeiT-LT: Enhancing Vision Transformer Training on Long-Tailed Datasets with Distillation

Introduction

"DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets" presents an innovative approach to training Vision Transformers (ViTs) effectively on long-tailed datasets by leveraging distillation techniques. The premise of the work is rooted in the challenge posed by long-tailed distributions prevalent in real-world datasets, where a small number of classes (the "head") possess a large number of examples, while a larger number of classes (the "tail") have relatively few examples. The authors propose DeiT-LT, a novel training framework designed to enhance the performance of ViTs on such imbalanced datasets without necessitating large-scale pre-training.

Key Contributions

  1. Distillation DIST Token: DeiT-LT introduces an efficient and innovative method of knowledge distillation via out-of-distribution images, significantly improving ViTs’ performance on the tail classes by enabling the model to learn CNN-like features in the early blocks.
  2. Re-Weighting the Distillation Loss: A novel aspect of DeiT-LT is its approach to focusing on tail classes through re-weighting the distillation loss, which is essential for mitigating the imbalance challenge.
  3. Dual Expertise within ViT: The distilled ViT models embody dual expertise, where the classifier (CLS) token becomes proficient with head classes and the distillation (DIST) token excels with the tail classes. This dual expertise is crucial for addressing the disparity in class representation within the training data.
  4. Generalization through Low-Rank Features: DeiT-LT further incorporates distillation from flat CNN teachers trained via Sharpness Aware Minimization (SAM) to promote the learning of low-rank, hence more generalizable, features across all ViT blocks.

Numerical Results

The effectiveness of DeiT-LT is underscored by its performance across a range of datasets, from small-scale (e.g., CIFAR-10 LT and CIFAR-100 LT) to large-scale (e.g., ImageNet-LT and iNaturalist-2018) benchmarks. One notable numerical result is the significant improvement in performance on the CIFAR-100 LT dataset, showcasing DeiT-LT's capability to enhance learning on datasets with severe class imbalances.

Implications and Future Work

  • Practical Implications: The DeiT-LT framework holds considerable promise for practical applications, especially in domains where data imbalance is a perennial challenge. It mitigates the need for large-scale pre-training, making it a cost-effective solution for deploying ViTs in specialized areas such as medical imaging and satellite imagery analysis.
  • Theoretical Implications: From a theoretical viewpoint, this paper prompts further inquiry into the mechanisms through which distillation and feature re-weighting influence the learning dynamic of transformers vis-à-vis CNNs, particularly in the context of imbalanced datasets.
  • Speculation on Future Developments: The study sparks intrigue about the potential refinements in distillation techniques and their integration within transformer architectures. Future work could explore the synergy between different architectural modifications and distillation strategies to further bridge the performance gap across head and tail classes.

In conclusion, "DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets" presents a compelling methodology for enhancing the performance of Vision Transformers on imbalanced datasets. By introducing a novel distillation scheme and leveraging the merits of re-weighting distillation loss, the authors set a new precedence for training ViTs more effectively and efficiently in the face of long-tailed distributions.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.