- The paper introduces a DINO framework that uses self-distillation with Vision Transformers to learn visual representations without explicit labels.
- The paper reveals that self-supervised ViTs generate attention maps that clearly capture scene layouts and object boundaries, aiding unsupervised segmentation.
- The paper outlines critical implementation details like EMA updates, multi-crop strategies, and temperature tuning to achieve state-of-the-art performance.
This paper explores the application of self-supervised learning (SSL) to Vision Transformer (ViT) architectures and identifies several interesting properties that emerge without any explicit supervision. The primary framework used is DINO, which stands for self-distillation with no labels. DINO frames self-supervised learning as a knowledge distillation process where a student network learns to match the output of a teacher network for different augmented views of the same image. Crucially, the teacher network is an exponential moving average (EMA) of the student network's weights, providing a stable, evolving target.
DINO Framework Implementation
- Architecture: Both student (gθs) and teacher (gθt) networks share the same architecture (typically a ViT like DeiT or the original ViT, potentially with modifications like using a register token) but have different parameters. The teacher's parameters θt are updated using an EMA of the student's parameters θs:
θt←λθt+(1−λ)θs
where λ follows a cosine schedule from 0.996 to 1 during training.
- Input: The core idea involves feeding different distorted views (crops) of an image to the student and teacher. A set of "local" views (smaller crops, typically 96x96) are passed through the student, while a set of "global" views (larger crops, typically 224x224) are passed through both the student and the teacher. This multi-crop strategy is crucial for performance.
- Output Processing: Both networks output K-dimensional features using a projection head (a 3-layer MLP with hidden dimension 2048). The output represents a probability distribution over K prototypes, obtained via a softmax function with a temperature parameter τ.
Ps(x)=softmax(gθs(x)/τs)
Pt(x)=softmax((gθt(x)+c)/τt)
- Loss Function: The objective is to minimize the cross-entropy loss between the student's prediction for a view x1 and the teacher's prediction for a different view x2 of the same image. The loss is calculated over all pairs of global and local views, ensuring the student captures information consistent across different views, as interpreted by the teacher.
L=−x∈{x1g,x2g}∑x′∈V,x′=x∑Pt(x)logPs(x′)
where V is the set of all views (global and local) for an image. Note that gradients are only backpropagated through the student network (gθs).
- Avoiding Collapse: Two techniques prevent the model from collapsing to trivial solutions (e.g., outputting a uniform distribution or predicting a single dimension):
- Centering: The teacher's outputs are centered by subtracting the mean computed over the batch. gθt(x)←gθt(x)+c. The center c is updated via EMA.
- Sharpening: A lower temperature τt is used for the teacher's softmax compared to the student's (τs). This sharpens the teacher's output distribution, preventing it from becoming uniform.
Emerging Properties and Applications
The key finding is that self-supervised ViTs trained with DINO exhibit remarkable properties, particularly in their self-attention mechanisms:
- Scene Layout and Object Boundaries: The self-attention maps from the
[CLS]
token of the final layer explicitly capture the scene layout and segment objects within the image, even distinguishing between multiple instances of the same class. This happens without any explicit segmentation labels during training.
- Implementation: This can be visualized by extracting the attention weights from the last layer's
[CLS]
token to all patch tokens. Reshaping these weights back to the image's spatial dimensions reveals object masks.
- Application: This suggests potential for unsupervised or weakly-supervised object discovery, segmentation, and salient object detection. One could use these attention maps directly or as initial proposals for more complex segmentation pipelines.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
# Pseudocode for extracting attention maps
import torch
import torchvision.transforms.functional as TF
from PIL import Image
# Assume 'model' is a DINO-trained ViT and 'img_tensor' is the preprocessed input
# Get attentions from the last block
attentions = model.get_last_selfattention(img_tensor) # Shape: [batch_size, num_heads, num_patches+1, num_patches+1]
# We only need the attention maps from the [CLS] token (index 0)
# Average attention across heads
cls_attentions = attentions[0, :, 0, 1:].mean(dim=0) # Shape: [num_patches]
# Reshape to image grid (assuming square patches)
patch_size = model.patch_embed.patch_size
w_featmap = img_tensor.shape[-2] // patch_size
h_featmap = img_tensor.shape[-1] // patch_size
attention_map = cls_attentions.reshape(w_featmap, h_featmap).detach().cpu().numpy()
# Resize to original image size for visualization
original_size = (img_tensor.shape[-2], img_tensor.shape[-1]) # Or actual original image size
attention_map_resized = TF.resize(Image.fromarray(attention_map), original_size)
# attention_map_resized can now be overlaid on the original image |
- High-Quality Features: The features learned by DINO ViTs perform exceptionally well on downstream tasks using simple classifiers like k-Nearest Neighbors (k-NN). On ImageNet, DINO ViT features achieve high accuracy using just k-NN, outperforming previous SSL methods and even supervised baselines in some settings.
- Implementation: Extract the
[CLS]
token embedding from the trained DINO ViT backbone for each image in the dataset. Use these embeddings with a standard k-NN algorithm (like FAISS for efficiency) for classification.
- Application: This makes DINO ViTs excellent off-the-shelf feature extractors for tasks where labeled data is scarce but a large unlabeled dataset is available for pre-training. They are suitable for image retrieval, clustering, and few-shot learning.
- Performance Comparison:
- DINO ViTs significantly outperform SSL methods applied to convolutional networks (CNNs) like ResNets.
- They also surpass supervised ViTs when evaluated using k-NN or linear probing, suggesting SSL is particularly well-suited for the ViT architecture.
- The performance gap widens with smaller ViT models (e.g., DeiT-S), indicating DINO is effective even with less capacity.
Implementation Considerations
- Compute Requirements: Training DINO ViTs requires significant computational resources, typically multiple GPUs (e.g., 8-16 V100/A100 GPUs) for several days, especially on large datasets like ImageNet. Mixed-precision training and large batch sizes (e.g., 1024) are often necessary.
- Hyperparameters: Careful tuning of the EMA decay (λ), temperature parameters (τs,τt), and optimizer settings (AdamW with weight decay, learning rate schedule) is important. The multi-crop augmentation strategy (number and size of crops) also impacts results.
- Code Availability: The authors released code based on PyTorch, facilitating reproduction and application.
In summary, the paper demonstrates that self-supervised pre-training using a self-distillation approach (DINO) enables Vision Transformers to learn powerful visual representations. These representations exhibit emergent object segmentation capabilities within their attention maps and achieve state-of-the-art performance on various downstream tasks, particularly when using simple evaluation protocols like k-NN classification, highlighting their potential for practical applications with limited supervision.