Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 464 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation (2105.05537v1)

Published 12 May 2021 in eess.IV and cs.CV

Abstract: In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. Especially, the deep neural networks based on U-shaped architecture and skip-connections have been widely applied in a variety of medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global and long-range semantic information interaction well due to the locality of the convolution operation. In this paper, we propose Swin-Unet, which is an Unet-like pure Transformer for medical image segmentation. The tokenized image patches are fed into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning. Specifically, we use hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with patch expanding layer is designed to perform the up-sampling operation to restore the spatial resolution of the feature maps. Under the direct down-sampling and up-sampling of the inputs and outputs by 4x, experiments on multi-organ and cardiac segmentation tasks demonstrate that the pure Transformer-based U-shaped Encoder-Decoder network outperforms those methods with full-convolution or the combination of transformer and convolution. The codes and trained models will be publicly available at https://github.com/HuCaoFighting/Swin-Unet.

Citations (2,258)

Summary

  • The paper introduces a pure transformer-based U-shaped architecture that leverages hierarchical Swin Transformer blocks to overcome CNN limitations in capturing global context.
  • It employs a novel patch expanding layer and optimized skip connections for efficient up-sampling, outperforming traditional interpolation methods.
  • Experimental evaluations on Synapse and ACDC datasets show high Dice scores and improved edge prediction, validating its superior performance in medical segmentation.

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

Introduction

The paper presents Swin-Unet, a transformer-based U-shaped architecture designed for medical image segmentation. Traditional segmentation methods often rely on convolutional neural networks (CNNs) with U-shaped architectures due to their ability to learn discriminating features. However, these CNN architectures face challenges in capturing global context due to the intrinsic locality of convolution operations. Swin-Unet addresses this limitation by employing a purely transformer-based network, leveraging hierarchical Swin Transformers with shifted windows for both encoding and decoding the features for better semantic learning and segmentation accuracy.

Architecture Overview

Swin-Unet builds upon Swin Transformers and incorporates a U-shaped encoder-decoder architecture with skip connections, similar to the classical U-Net framework but without any convolution operations. Figure 1

Figure 1: The architecture of Swin-Unet, consisting of encoder, bottleneck, decoder, and skip connections—constructed using Swin Transformer blocks.

The process begins with tokenizing the input images into non-overlapping patches, which are then processed through Swin Transformer blocks in the encoder to capture hierarchical feature representations. A novel patch expanding layer in the decoder is utilized for up-sampling, which enhances resolution by reshaping features without interpolation. The combined hierarchical learning with Swin Transformer blocks allows for efficient local-global feature learning.

Swin Transformer Block

A key component of this architecture is the Swin Transformer block, which adopts window-based multi-head self-attention (W-MSA) and shifted windowing (SW-MSA) mechanisms. This structure allows the network to compute self-attention efficiently across partitioned image patches, providing enhanced contextual and hierarchical feature extraction capabilities. Figure 2

Figure 2: Swin Transformer block.

Experimental Evaluation

The benchmark evaluation of Swin-Unet was conducted on the Synapse multi-organ CT and ACDC datasets, demonstrating superior performance over existing methods. Swin-Unet achieved a Dice-Similarity Coefficient (DSC) of 79.13% and Hausdorff Distance (HD) of 21.55 on the Synapse dataset, with improved edge prediction accuracy reflected by the HD metric. Figure 3

Figure 3: The segmentation results of different methods on the Synapse multi-organ CT dataset.

The architecture effectively mitigates over-segmentation common in CNN-based methods, underscoring Swin-Unet's capability in learning robust long-range semantic information.

Ablation Studies

Various ablation studies were performed to assess the impact of different design choices on the model's performance:

  • Up-sampling Strategy: The patch expanding layer outperformed bilinear interpolation and transposed convolution in segmentation accuracy, highlighting its efficacy in feature resolution recovery.
  • Number of Skip Connections: The best performance was observed with three skip connections, ensuring robust spatial information recovery across scales.
  • Input Size and Model Scale: While increasing input resolution and model scale marginally improved accuracy, it significantly increased computational demands, leading to a trade-off between performance and efficiency.

Conclusion

Swin-Unet leverages the Swin Transformer for effective segmentation tasks, demonstrating that pure transformer architectures can surpass traditional convolution-based methods in medical image analysis. The integration of the hierarchical structure and attention mechanisms allows Swin-Unet to efficiently glean semantic information, establishing a promising approach for robust 2D medical image segmentation.

Future research may focus on adapting Swin-Unet for 3D medical image segmentation, and exploring pre-training strategies to further improve its applicability in medical imaging contexts.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com