Emergent Mind

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

(2103.14030)
Published Mar 25, 2021 in cs.CV and cs.LG

Abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.

Shifted window approach for self-attention in Swin Transformer, enhancing inter-window connections.

Overview

  • The Swin Transformer model introduces a hierarchical structure and shifted window approach for efficient self-attention computation in vision tasks, allowing for effective management of varying visual scales.

  • The model achieves state-of-the-art performance in image classification, object detection, and semantic segmentation, significantly outperforming previous models like ResNet and DeiT.

  • The hierarchical design and novel window-based self-attention not only enhance vision-specific applications but also suggest potential benefits for multi-modal learning and cross-domain Transformer architectures.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

The presented paper explore the Swin Transformer, a novel vision Transformer model designed to serve as a versatile backbone for various computer vision tasks. This model addresses the inherent challenges posed by adapting Transformers, originally developed for NLP, to vision problems. These challenges primarily stem from the significant differences in domain characteristics, such as the scale variability of visual entities and the high pixel resolution of images as opposed to the relatively lower resolution and fixed scale of textual tokens.

Key Characteristics of the Swin Transformer

The Swin Transformer is distinguished by its hierarchical structure and the use of shifted windows for computing self-attention.

  1. Hierarchical Representation: By partitioning images into non-overlapping patches and iteratively merging them to form hierarchical representations, the Swin Transformer encompasses the flexibility to model visuals at various scales. This hierarchical approach enables the model to effectively manage the large variations in the scale of visual entities, similar to techniques like Feature Pyramid Networks (FPN).
  2. Shifted Windows: The computation of self-attention is confined within local windows, drastically reducing computational complexity from quadratic to linear with respect to image size. The windows are alternated between non-overlapping and shifted configurations across different layers, ensuring cross-window connections that are critical for effective feature representation. This window-shifting strategy is found to improve performance significantly, providing a better balance between accuracy and efficiency.

Performance and Comparisons

The Swin Transformer demonstrates superior performance across several vision tasks, outperforming previous state-of-the-art models in both efficiency and accuracy.

  1. Image Classification: On the ImageNet-1K dataset, the Swin-T variant achieves a top-1 accuracy of 81.3%, significantly higher than comparable convolutional and Transformer-based models (e.g., ResNet and DeiT). When extended to ImageNet-22K pre-training and subsequent fine-tuning, the Swin-B and Swin-L models reach top-1 accuracies of 86.4% and 87.3%, respectively.
  2. Object Detection and Segmentation: Utilizing frameworks such as Cascade Mask R-CNN and ATSS on the COCO dataset, the Swin Transformer achieves box AP scores of up to 58.7 and mask AP scores of 51.1, surpassing previous best results by considerable margins. Specifically, the Swin-L model with advanced configurations (HTC++) demonstrates outstanding performance, recording significant gains in both box AP and mask AP.
  3. Semantic Segmentation: On the ADE20K dataset, the Swin Transformer yields compelling results, with the Swin-L model achieving 53.5 mIoU, which surpasses the previous best-performing model by 3.2 mIoU.

Implications and Future Directions

The Swin Transformer's success in vision tasks suggests that the model's hierarchical design and shifted window approach are effective strategies for adapting Transformer architectures to computer vision. This not only opens avenues for further optimization in vision-specific models but also signals potential for unified models that span both vision and language domains. Such unified models can leverage the strengths of both fields, fostering advancements in multi-modal learning and joint visual-textual tasks.

Moreover, the principles underlying the Swin Transformer, particularly the shifted window-based self-attention, could be explored for efficiency enhancements in NLP applications. This approach presents a promising direction for future research, aiming to harmonize the application of Transformers across varying data modalities while preserving computational efficiency.

Conclusion

The Swin Transformer represents a significant advancement in the utilization of Transformer models for computer vision, achieving state-of-the-art results and offering new insights into efficient self-attention mechanisms. Its hierarchical and shifted window-based design not only addresses the unique challenges of vision tasks but also sets a foundation for future cross-domain Transformer architectures. As both practical and theoretical implications unfold, the Swin Transformer is poised to influence ongoing developments in AI and machine learning, driving forward innovation in model design and application scope.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube