Vision Transformers with Patch Diversification

Published 26 Apr 2021 in cs.CV and cs.LG | (2104.12753v3)

Abstract: Vision transformer has demonstrated promising performance on challenging computer vision tasks. However, directly training the vision transformers may yield unstable and sub-optimal results. Recent works propose to improve the performance of the vision transformers by modifying the transformer structures, e.g., incorporating convolution layers. In contrast, we investigate an orthogonal approach to stabilize the vision transformer training without modifying the networks. We observe the instability of the training can be attributed to the significant similarity across the extracted patch representations. More specifically, for deep vision transformers, the self-attention blocks tend to map different patches into similar latent representations, yielding information loss and performance degradation. To alleviate this problem, in this work, we introduce novel loss functions in vision transformer training to explicitly encourage diversity across patch representations for more discriminative feature extraction. We empirically show that our proposed techniques stabilize the training and allow us to train wider and deeper vision transformers. We further show the diversified features significantly benefit the downstream tasks in transfer learning. For semantic segmentation, we enhance the state-of-the-art (SOTA) results on Cityscapes and ADE20k. Our code is available at https://github.com/ChengyueGongR/PatchVisionTransformer.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (59)

View on Semantic Scholar

Summary

The paper introduces patch-wise cosine, contrastive, and mixing loss functions to reduce patch similarity and enhance feature discrimination.
It demonstrates that diversified patch representations stabilize training and improve performance, with accuracy gains on models like DeiT-Base24 and SWIN-Transformer.
The methods also improve semantic segmentation, achieving 83.6 mIoU on Cityscapes and 54.5 mIoU on ADE20k, highlighting the practical benefits of patch diversification.

Vision Transformers with Patch Diversification

The paper "Vision Transformers with Patch Diversification" addresses key challenges present in training vision transformers, mainly the instability and sub-optimal results observed when training large models directly. Vision transformers, known for their promising performance in various computer vision tasks, are prone to issues related to information loss and performance degradation due to high similarity among patch representations. The authors propose a methodological shift that encourages patch diversity without altering the network structure, using novel loss functions to stabilize training and enhance the discriminative feature extraction capabilities of vision transformers.

Key Contributions

The authors highlight three principal contributions to the field of vision transformer research:

Patch-wise Cosine Loss: A regularization technique that minimizes the cosine similarity between patch representations. This straightforward approach targets the inherent oversmoothing problem found in deep vision transformers, promoting a more significant variance among patch representations and enhancing overall model expressiveness.
Patch-wise Contrastive Loss: Based on the observation that early layer representations are naturally more diverse, this loss function uses a contrastive approach to ensure that representations of the same patch remain consistent while forcing different patches to capture varied aspects of the input. The technique fosters heterogeneity in learned features by maintaining a balance of similarity within representations derived from the same image.
Patch-wise Mixing Loss: Inspired by the CutMix augmentation strategy, this technique assigns class labels to mixed patches from different images. The method motivates the self-attention layers to focus on patches relevant to their respective categories, thereby improving feature discrimination and robustness.

Empirical Validation

Empirical results provide evidence that the methods proposed in the paper substantially stabilize the training process for vision transformers. Noteworthy improvements in image classification accuracy are reported, with the DeiT-Base24 model achieving an enhancement from 82.1% in top-1 accuracy to 83.3% with applied patch diversification techniques. Similarly, enhanced outcomes are noted for the SWIN-Transformer on ImageNet datasets.

Furthermore, the diversity methods contribute to significant gains in semantic segmentation tasks. Applying the backbone models trained with patch diversification to semantic segmentation led to the improvement of state-of-the-art results on Cityscapes and ADE20k datasets, achieving 83.6 and 54.5 mIoU, respectively.

Implications and Future Work

The implications of this research are broad, offering pathways for optimizing vision transformer training practices and improving application outcomes in complex vision tasks. The ability to train deeper and wider models with stability and improved accuracy without additional architectural complexities presents considerable value to the field. However, extending these techniques to other transformer paradigms and verifying their effectiveness across different types of data remains a prospective area for future investigation.

In conclusion, the diversification strategies proposed exhibit potential for advancing the practical utility and performance of vision transformers. As the discussion around the superiorities and drawbacks of transformers vis-à-vis CNNs continues, this paper enriches the discourse by providing empirically backed methodologies that address prevailing challenges in the vision transformer domain.

Markdown Report Issue