SA-Med2D-20M Dataset: Segment Anything in 2D Medical Imaging with 20 Million masks

Published 20 Nov 2023 in eess.IV and cs.CV | (2311.11969v1)

Abstract: Segment Anything Model (SAM) has achieved impressive results for natural image segmentation with input prompts such as points and bounding boxes. Its success largely owes to massive labeled training data. However, directly applying SAM to medical image segmentation cannot perform well because SAM lacks medical knowledge -- it does not use medical images for training. To incorporate medical knowledge into SAM, we introduce SA-Med2D-20M, a large-scale segmentation dataset of 2D medical images built upon numerous public and private datasets. It consists of 4.6 million 2D medical images and 19.7 million corresponding masks, covering almost the whole body and showing significant diversity. This paper describes all the datasets collected in SA-Med2D-20M and details how to process these datasets. Furthermore, comprehensive statistics of SA-Med2D-20M are presented to facilitate the better use of our dataset, which can help the researchers build medical vision foundation models or apply their models to downstream medical applications. We hope that the large scale and diversity of SA-Med2D-20M can be leveraged to develop medical artificial intelligence for enhancing diagnosis, medical image analysis, knowledge sharing, and education. The data with the redistribution license is publicly available at https://github.com/OpenGVLab/SAM-Med2D.

Abstract PDF Upgrade to Chat

Citations (25)

View on Semantic Scholar

Summary

The paper introduces a large-scale medical segmentation dataset with 4.6M images and 19.7M masks.
It details meticulous preprocessing that normalizes voxel values and converts multi-label masks into clear binary formats.
The dataset spans diverse imaging modalities and anatomical regions, empowering targeted SAM training for clinical applications.

Overview of the SA-Med2D-20M Dataset for 2D Medical Imaging

The paper introduces the SA-Med2D-20M dataset, designed to enhance the application of the Segment Anything Model (SAM) within the field of medical image segmentation. While SAM has demonstrated substantial success in natural imagery through large-scale datasets, its efficacy in the medical domain is constrained by a lack of domain-specific training. The SA-Med2D-20M dataset seeks to address this limitation by providing a robust and diverse compilation of 4.6 million 2D medical images accompanied by 19.7 million segmentation masks. This collection covers a vast array of anatomical structures across ten imaging modalities.

Dataset Composition and Properties

The dataset draws from numerous public and private sources to compile what is currently the largest available dataset for medical image segmentation. Key features include:

Modality Diversity: Encompassing modalities such as CT, MR, and ultrasound, the dataset captures a comprehensive range of imaging techniques used in clinical settings.
Anatomical Coverage: The dataset categorizes images into various anatomical regions, including head and neck, thorax, and abdomen. It further incorporates lesion-focused datasets, appealing to the segmentation of pathological areas.
Label and Image Volume: With over 219 labels categorized, each image can be associated with multiple segmentation masks to ensure precise object localization.

Data Processing and Normalization

The dataset's construction involved meticulous preprocessing steps to ensure consistency:

Normalization: Voxel values are streamlined to a unified scale, facilitating the use of standard formats across varying modalities.
Mask Processing: Original multi-label masks are split into binary masks, with separate connected components distinguished within categories, addressing overlaps and ensuring clarity in segmentation.

Implications for Medical AI and Future Work

The SA-Med2D-20M dataset holds significant implications for advancing medical AI, particularly in medical image segmentation. It allows for the development of medical-specific vision foundation models that are adaptable across diverse clinical tasks. Given the general scarcity of large-scale multimodal medical datasets, this collection positions itself as a critical resource for both supervised training and self-supervised learning approaches.

Future developments may include addressing limitations related to data imbalances and incomplete labels by potentially utilizing methods such as pseudo-labeling and expanding the dataset further. Collaborative efforts to enhance dataset representation could significantly impact the development and validation of robust medical AI models.

Conclusion

SA-Med2D-20M stands out as a pivotal contribution to the domain of medical imaging, offering a structured and expansive dataset aimed at bridging the gap between natural and medical imaging applications within AI models. Its significance is underscored by its scale and diversity, establishing a foundation for future advancements in medical image analysis and diagnosis support systems.

Markdown Report Issue