Self-Supervised Learning with Swin Transformers

Published 10 May 2021 in cs.CV | (2105.04553v2)

Abstract: We are witnessing a modeling shift from CNN to Transformers in computer vision. In this work, we present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture. The approach basically has no new inventions, which is combined from MoCo v2 and BYOL and tuned to achieve reasonably high accuracy on ImageNet-1K linear evaluation: 72.8% and 75.0% top-1 accuracy using DeiT-S and Swin-T, respectively, by 300-epoch training. The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks. More importantly, the general-purpose Swin Transformer backbone enables us to also evaluate the learnt representations on downstream tasks such as object detection and semantic segmentation, in contrast to a few recent approaches built on ViT/DeiT which only report linear evaluation results on ImageNet-1K due to ViT/DeiT not tamed for these dense prediction tasks. We hope our results can facilitate more comprehensive evaluation of self-supervised learning methods designed for Transformer architectures. Our code and models are available at https://github.com/SwinTransformer/Transformer-SSL, which will be continually enriched.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (169)

View on Semantic Scholar

Summary

The paper introduces MoBY, a novel self-supervised framework that combines MoCo v2 and BYOL strategies using Swin Transformers.
It reports a top-1 accuracy of 75.0% with Swin-T on ImageNet-1K and demonstrates robust results in downstream tasks like detection and segmentation.
The study underscores the efficacy of Swin Transformers in SSL while highlighting the need for further optimizations to surpass supervised methods.

Self-Supervised Learning with Swin Transformers

The paper presents a comprehensive study on the integration of self-supervised learning (SSL) techniques with Swin Transformers within the domain of computer vision. The investigation explores a novel approach named MoBY, which merges mechanisms from MoCo v2 and BYOL methodologies. The emphasis lies on assessing performance not only for linear evaluations on ImageNet-1K but also for crucial downstream tasks such as object detection and semantic segmentation.

Overview of Techniques

MoBY leverages Swin Transformers due to their hierarchical architecture and efficiency in attention computation, making them viable for diverse computer vision tasks. The primary innovation lies in combining existing SSL methods while tweaking hyper-parameters to produce significant outcomes. MoBY achieves a top-1 accuracy of 72.8% with DeiT-S and 75.0% with Swin-T on ImageNet-1K after 300 epochs. These results slightly surpass those of MoCo v3 and DINO, despite employing fewer computational tricks.

Performance and Evaluations

The experiments demonstrate MoBY’s utility via several evaluation metrics:

ImageNet-1K Linear Evaluation: The incorporation of Swin Transformers allows MoBY to outperform traditional Transformer backbones such as DeiT in the linear evaluation metrics. Swin-T leads to a 2.2% increase in top-1 accuracy over DeiT-S, highlighting the architectural benefits.
Downstream Tasks: Assessing Swin Transformers on COCO object detection and ADE20K for semantic segmentation, MoBY matches the performance of supervised methods, indicating robustness in feature learning. However, unlike previous approaches using ResNet backbones, MoBY does not outstrip supervised learning, suggesting potential avenues for further research.

Implications and Future Directions

The results present two key implications:

Architectural Efficacy of Swin Transformers: The adaptability and versatility of Swin Transformers are evident when used as a backbone for self-supervised frameworks, providing a broader evaluation scope that encompasses downstream tasks besides ImageNet classification.
Further Optimizations Needed: The inability of MoBY to outperform traditional supervised methods using Transformers suggests the need for additional methods or enhancements in SSL techniques tailored for Transformer-based architectures.

Moving forward, researchers should investigate the incorporation of advanced augmentation techniques, optimization algorithms, or architectural tweaks to exploit the full potential of self-supervised learning in Transformer-based systems. Additionally, exploring how the integration of other learning paradigms may enhance the efficacy of SSL could further contribute to the development of more generalized and effective vision models.

In conclusion, the paper contributes a valuable perspective on leveraging self-supervised learning with Transformer architectures, indicating significant advances without groundbreaking novelty. The conclusions provide a basis for ongoing inquiry into optimizing Transformer frameworks for various computer vision applications.

Markdown Report Issue