Papers
Topics
Authors
Recent
2000 character limit reached

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation (2105.05537v1)

Published 12 May 2021 in eess.IV and cs.CV

Abstract: In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. Especially, the deep neural networks based on U-shaped architecture and skip-connections have been widely applied in a variety of medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global and long-range semantic information interaction well due to the locality of the convolution operation. In this paper, we propose Swin-Unet, which is an Unet-like pure Transformer for medical image segmentation. The tokenized image patches are fed into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning. Specifically, we use hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with patch expanding layer is designed to perform the up-sampling operation to restore the spatial resolution of the feature maps. Under the direct down-sampling and up-sampling of the inputs and outputs by 4x, experiments on multi-organ and cardiac segmentation tasks demonstrate that the pure Transformer-based U-shaped Encoder-Decoder network outperforms those methods with full-convolution or the combination of transformer and convolution. The codes and trained models will be publicly available at https://github.com/HuCaoFighting/Swin-Unet.

Citations (2,258)

Summary

  • The paper introduces a pure transformer-based U-shaped architecture that leverages hierarchical Swin Transformer blocks to overcome CNN limitations in capturing global context.
  • It employs a novel patch expanding layer and optimized skip connections for efficient up-sampling, outperforming traditional interpolation methods.
  • Experimental evaluations on Synapse and ACDC datasets show high Dice scores and improved edge prediction, validating its superior performance in medical segmentation.

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

Introduction

The paper presents Swin-Unet, a transformer-based U-shaped architecture designed for medical image segmentation. Traditional segmentation methods often rely on convolutional neural networks (CNNs) with U-shaped architectures due to their ability to learn discriminating features. However, these CNN architectures face challenges in capturing global context due to the intrinsic locality of convolution operations. Swin-Unet addresses this limitation by employing a purely transformer-based network, leveraging hierarchical Swin Transformers with shifted windows for both encoding and decoding the features for better semantic learning and segmentation accuracy.

Architecture Overview

Swin-Unet builds upon Swin Transformers and incorporates a U-shaped encoder-decoder architecture with skip connections, similar to the classical U-Net framework but without any convolution operations. Figure 1

Figure 1: The architecture of Swin-Unet, consisting of encoder, bottleneck, decoder, and skip connections—constructed using Swin Transformer blocks.

The process begins with tokenizing the input images into non-overlapping patches, which are then processed through Swin Transformer blocks in the encoder to capture hierarchical feature representations. A novel patch expanding layer in the decoder is utilized for up-sampling, which enhances resolution by reshaping features without interpolation. The combined hierarchical learning with Swin Transformer blocks allows for efficient local-global feature learning.

Swin Transformer Block

A key component of this architecture is the Swin Transformer block, which adopts window-based multi-head self-attention (W-MSA) and shifted windowing (SW-MSA) mechanisms. This structure allows the network to compute self-attention efficiently across partitioned image patches, providing enhanced contextual and hierarchical feature extraction capabilities. Figure 2

Figure 2: Swin Transformer block.

Experimental Evaluation

The benchmark evaluation of Swin-Unet was conducted on the Synapse multi-organ CT and ACDC datasets, demonstrating superior performance over existing methods. Swin-Unet achieved a Dice-Similarity Coefficient (DSC) of 79.13% and Hausdorff Distance (HD) of 21.55 on the Synapse dataset, with improved edge prediction accuracy reflected by the HD metric. Figure 3

Figure 3: The segmentation results of different methods on the Synapse multi-organ CT dataset.

The architecture effectively mitigates over-segmentation common in CNN-based methods, underscoring Swin-Unet's capability in learning robust long-range semantic information.

Ablation Studies

Various ablation studies were performed to assess the impact of different design choices on the model's performance:

  • Up-sampling Strategy: The patch expanding layer outperformed bilinear interpolation and transposed convolution in segmentation accuracy, highlighting its efficacy in feature resolution recovery.
  • Number of Skip Connections: The best performance was observed with three skip connections, ensuring robust spatial information recovery across scales.
  • Input Size and Model Scale: While increasing input resolution and model scale marginally improved accuracy, it significantly increased computational demands, leading to a trade-off between performance and efficiency.

Conclusion

Swin-Unet leverages the Swin Transformer for effective segmentation tasks, demonstrating that pure transformer architectures can surpass traditional convolution-based methods in medical image analysis. The integration of the hierarchical structure and attention mechanisms allows Swin-Unet to efficiently glean semantic information, establishing a promising approach for robust 2D medical image segmentation.

Future research may focus on adapting Swin-Unet for 3D medical image segmentation, and exploring pre-training strategies to further improve its applicability in medical imaging contexts.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.