Emerging Properties in Self-Supervised Vision Transformers (2104.14294v2)

Published 29 Apr 2021 in cs.CV

Abstract: In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

Authors (7)

Mathilde Caron (25 papers)
Hugo Touvron (22 papers)
Ishan Misra (65 papers)
Hervé Jégou (71 papers)
Julien Mairal (98 papers)
Piotr Bojanowski (50 papers)
Armand Joulin (81 papers)

Citations (4,954)

View on Semantic Scholar

Summary

The paper introduces a DINO framework that uses self-distillation with Vision Transformers to learn visual representations without explicit labels.
The paper reveals that self-supervised ViTs generate attention maps that clearly capture scene layouts and object boundaries, aiding unsupervised segmentation.
The paper outlines critical implementation details like EMA updates, multi-crop strategies, and temperature tuning to achieve state-of-the-art performance.

This paper explores the application of self-supervised learning (SSL) to Vision Transformer (ViT) architectures and identifies several interesting properties that emerge without any explicit supervision. The primary framework used is DINO, which stands for self-distillation with no labels. DINO frames self-supervised learning as a knowledge distillation process where a student network learns to match the output of a teacher network for different augmented views of the same image. Crucially, the teacher network is an exponential moving average (EMA) of the student network's weights, providing a stable, evolving target.

DINO Framework Implementation

Architecture: Both student ( $g_{\theta_s}$ ) and teacher ( $g_{\theta_t}$ ) networks share the same architecture (typically a ViT like DeiT or the original ViT, potentially with modifications like using a register token) but have different parameters. The teacher's parameters $\theta_t$ are updated using an EMA of the student's parameters $\theta_s$ :

$\theta_t \leftarrow \lambda \theta_t + (1 - \lambda) \theta_s$

where $\lambda$ follows a cosine schedule from 0.996 to 1 during training.
Input: The core idea involves feeding different distorted views (crops) of an image to the student and teacher. A set of "local" views (smaller crops, typically 96x96) are passed through the student, while a set of "global" views (larger crops, typically 224x224) are passed through both the student and the teacher. This multi-crop strategy is crucial for performance.
Output Processing: Both networks output $K$ -dimensional features using a projection head (a 3-layer MLP with hidden dimension 2048). The output represents a probability distribution over $K$ prototypes, obtained via a softmax function with a temperature parameter $\tau$ .

$P_s(x) = \text{softmax}(g_{\theta_s}(x) / \tau_s)$

$P_t(x) = \text{softmax}((g_{\theta_t}(x) + c) / \tau_t)$
Loss Function: The objective is to minimize the cross-entropy loss between the student's prediction for a view $x_1$ and the teacher's prediction for a different view $x_2$ of the same image. The loss is calculated over all pairs of global and local views, ensuring the student captures information consistent across different views, as interpreted by the teacher.

$\mathcal{L} = - \sum_{x \in \{x_1^g, x_2^g\}} \sum_{x' \in V, x' \neq x} P_t(x) \log P_s(x')$

where $V$ is the set of all views (global and local) for an image. Note that gradients are only backpropagated through the student network ( $g_{\theta_s}$ ).
Avoiding Collapse: Two techniques prevent the model from collapsing to trivial solutions (e.g., outputting a uniform distribution or predicting a single dimension):
- Centering: The teacher's outputs are centered by subtracting the mean computed over the batch. $g_{\theta_t}(x) \leftarrow g_{\theta_t}(x) + c$ . The center $c$ is updated via EMA.
- Sharpening: A lower temperature $\tau_t$ is used for the teacher's softmax compared to the student's ( $\tau_s$ ). This sharpens the teacher's output distribution, preventing it from becoming uniform.

Emerging Properties and Applications

The key finding is that self-supervised ViTs trained with DINO exhibit remarkable properties, particularly in their self-attention mechanisms:

Scene Layout and Object Boundaries: The self-attention maps from the [CLS] token of the final layer explicitly capture the scene layout and segment objects within the image, even distinguishing between multiple instances of the same class. This happens without any explicit segmentation labels during training.
- Implementation: This can be visualized by extracting the attention weights from the last layer's [CLS] token to all patch tokens. Reshaping these weights back to the image's spatial dimensions reveals object masks.
- Application: This suggests potential for unsupervised or weakly-supervised object discovery, segmentation, and salient object detection. One could use these attention maps directly or as initial proposals for more complex segmentation pipelines.

# Pseudocode for extracting attention maps
import torch
import torchvision.transforms.functional as TF
from PIL import Image

# Assume 'model' is a DINO-trained ViT and 'img_tensor' is the preprocessed input
# Get attentions from the last block
attentions = model.get_last_selfattention(img_tensor) # Shape: [batch_size, num_heads, num_patches+1, num_patches+1]

# We only need the attention maps from the [CLS] token (index 0)
# Average attention across heads
cls_attentions = attentions[0, :, 0, 1:].mean(dim=0) # Shape: [num_patches]

# Reshape to image grid (assuming square patches)
patch_size = model.patch_embed.patch_size
w_featmap = img_tensor.shape[-2] // patch_size
h_featmap = img_tensor.shape[-1] // patch_size
attention_map = cls_attentions.reshape(w_featmap, h_featmap).detach().cpu().numpy()

# Resize to original image size for visualization
original_size = (img_tensor.shape[-2], img_tensor.shape[-1]) # Or actual original image size
attention_map_resized = TF.resize(Image.fromarray(attention_map), original_size)

# attention_map_resized can now be overlaid on the original image

High-Quality Features: The features learned by DINO ViTs perform exceptionally well on downstream tasks using simple classifiers like k-Nearest Neighbors (k-NN). On ImageNet, DINO ViT features achieve high accuracy using just k-NN, outperforming previous SSL methods and even supervised baselines in some settings.
- Implementation: Extract the [CLS] token embedding from the trained DINO ViT backbone for each image in the dataset. Use these embeddings with a standard k-NN algorithm (like FAISS for efficiency) for classification.
- Application: This makes DINO ViTs excellent off-the-shelf feature extractors for tasks where labeled data is scarce but a large unlabeled dataset is available for pre-training. They are suitable for image retrieval, clustering, and few-shot learning.
Performance Comparison:
- DINO ViTs significantly outperform SSL methods applied to convolutional networks (CNNs) like ResNets.
- They also surpass supervised ViTs when evaluated using k-NN or linear probing, suggesting SSL is particularly well-suited for the ViT architecture.
- The performance gap widens with smaller ViT models (e.g., DeiT-S), indicating DINO is effective even with less capacity.

Implementation Considerations

Compute Requirements: Training DINO ViTs requires significant computational resources, typically multiple GPUs (e.g., 8-16 V100/A100 GPUs) for several days, especially on large datasets like ImageNet. Mixed-precision training and large batch sizes (e.g., 1024) are often necessary.
Hyperparameters: Careful tuning of the EMA decay ( $\lambda$ ), temperature parameters ( $\tau_s, \tau_t$ ), and optimizer settings (AdamW with weight decay, learning rate schedule) is important. The multi-crop augmentation strategy (number and size of crops) also impacts results.
Code Availability: The authors released code based on PyTorch, facilitating reproduction and application.

In summary, the paper demonstrates that self-supervised pre-training using a self-distillation approach (DINO) enables Vision Transformers to learn powerful visual representations. These representations exhibit emergent object segmentation capabilities within their attention maps and achieve state-of-the-art performance on various downstream tasks, particularly when using simple evaluation protocols like k-NN classification, highlighting their potential for practical applications with limited supervision.

PDF Markdown

Related Papers

Tweets

https://twitter.com/abursuc/status/1783243732831109500

https://twitter.com/veraKalin86/status/1766951861456953537

YouTube

Show All Videos