VM-UNet: Vision Mamba UNet for Medical Image Segmentation (2402.02491v2)

Published 4 Feb 2024 in cs.CV and eess.IV

Abstract: In the realm of medical image segmentation, both CNN-based and Transformer-based models have been extensively explored. However, CNNs exhibit limitations in long-range modeling capabilities, whereas Transformers are hampered by their quadratic computational complexity. Recently, State Space Models (SSMs), exemplified by Mamba, have emerged as a promising approach. They not only excel in modeling long-range interactions but also maintain a linear computational complexity. In this paper, leveraging state space models, we propose a U-shape architecture model for medical image segmentation, named Vision Mamba UNet (VM-UNet). Specifically, the Visual State Space (VSS) block is introduced as the foundation block to capture extensive contextual information, and an asymmetrical encoder-decoder structure is constructed with fewer convolution layers to save calculation cost. We conduct comprehensive experiments on the ISIC17, ISIC18, and Synapse datasets, and the results indicate that VM-UNet performs competitively in medical image segmentation tasks. To our best knowledge, this is the first medical image segmentation model constructed based on the pure SSM-based model. We aim to establish a baseline and provide valuable insights for the future development of more efficient and effective SSM-based segmentation systems. Our code is available at https://github.com/JCruan519/VM-UNet.

Citations (140)

View on Semantic Scholar

Summary

The paper presents VM-UNet, which leverages linear computational complexity to outperform CNN and Transformer models in medical image segmentation.
The methodology utilizes Visual State Space blocks within an asymmetrical encoder-decoder structure to capture extensive contextual information.
Empirical results on ISIC and Synapse datasets exhibit superior mIoU, DSC, and HD95 metrics, underscoring the model's clinical applicability.

Overview of VM-UNet: Vision Mamba UNet for Medical Image Segmentation

The paper "VM-UNet: Vision Mamba UNet for Medical Image Segmentation" presents a novel architecture for medical image segmentation tasks, leveraging State Space Models (SSMs). The authors propose VM-UNet, a U-shaped pure SSM-based model aiming to overcome limitations associated with CNN and Transformer-based methods.

Introduction

The authors address the need for efficient modeling of long-range interactions crucial for medical image segmentation. Traditional CNN models struggle with long-range dependencies due to limited receptive fields, while Transformer-based models encounter computational inefficiency because of their quadratic complexity. This work introduces VM-UNet, based on the Mamba SSM, which provides linear computational complexity alongside long-range modeling capabilities.

Methodology

VM-UNet features a U-shaped architecture comprising an asymmetrical encoder-decoder structure. A Visual State Space (VSS) block, derived from VMamba, is utilized as the building block for both the encoder and decoder. This block is designed to capture extensive contextual information through a linear Ordinary Differential Equation (ODE) discretized for deep learning, ensuring computational efficiency.

Architecture Details:
- Encoder: Utilizes VSS blocks and patch merging for downsampling.
- Decoder: Comprises VSS blocks with patch expanding operations to restore feature dimensions.
- Skip Connections: Implemented using simple addition to retain organizational clarity.

Empirical Evaluation

The authors conducted experiments on ISIC17, ISIC18, and Synapse datasets, focusing on tasks such as skin lesion and organ segmentation. VM-UNet demonstrated competitive performance with strong numerical results, surpassing other models in key metrics:

On ISIC17 and ISIC18 datasets, VM-UNet achieved superior Mean Intersection over Union (mIoU) and Dice Similarity Coefficient (DSC) compared to both CNN-based and Transformer-based counterparts.
On the Synapse dataset, the model showed notable improvement in DSC and 95% Hausdorff Distance (HD95) over existing models like Swin-UNet.

Implications and Future Directions

The implications of this work are manifold. Firstly, it establishes a baseline for pure SSM-based segmentation models in medical image segmentation tasks. This approach promises efficiency due to its linear complexity and robust performance in capturing long-range dependencies. Furthermore, it opens avenues for:

Designing SSM-specific modules tailored for segmentation tasks.
Exploring model compression techniques to enhance applicability in resource-constrained medical settings.
Investigating higher-resolution segmentation leveraging SSMs' ability to process long sequences.
Potential applications in other medical imaging tasks such as detection, registration, and reconstruction.

Conclusion

VM-UNet represents a significant exploration into the capabilities of SSM-based architectures for medical image segmentation. While the paper sets a commendable foundation, it also highlights future research directions to refine and expand the use of SSMs within medical imaging domains, potentially influencing future developments in efficient and effective medical image analysis.

PDF Markdown

Related Papers

GitHub

GitHub - JCruan519/VM-UNet: This is the official code repository for "VM-UNet: Vision Mamba UNet for Medical Image Segmentation". (358 stars)