Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

124 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Exploring scalable medical image encoders beyond text supervision (2401.10815v3)

Published 19 Jan 2024 in cs.CV

Abstract: Language-supervised pre-training has proven to be a valuable method for extracting semantically meaningful features from images, serving as a foundational element in multimodal systems within the computer vision and medical imaging domains. However, the computed features are limited by the information contained in the text, which is particularly problematic in medical imaging, where the findings described by radiologists focus on specific observations. This challenge is compounded by the scarcity of paired imaging-text data due to concerns over leakage of personal health information. In this work, we fundamentally challenge the prevailing reliance on language supervision for learning general-purpose biomedical imaging encoders. We introduce RAD-DINO, a biomedical image encoder pre-trained solely on unimodal biomedical imaging data that obtains similar or greater performance than state-of-the-art biomedical language-supervised models on a diverse range of benchmarks. Specifically, the quality of learned representations is evaluated on standard imaging tasks (classification and semantic segmentation), and a vision-language alignment task (text report generation from images). To further demonstrate the drawback of language supervision, we show that features from RAD-DINO correlate with other medical records (e.g., sex or age) better than language-supervised models, which are generally not mentioned in radiology reports. Finally, we conduct a series of ablations determining the factors in RAD-DINO's performance; notably, we observe that RAD-DINO's downstream performance scales well with the quantity and diversity of training data, demonstrating that image-only supervision is a scalable approach for training a foundational biomedical image encoder. Model weights of RAD-DINO trained on publicly available datasets are available at https://huggingface.co/microsoft/rad-dino.

References (123)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces RAD-DINO, a self-supervised encoder that forgoes text supervision to leverage unimodal medical imaging data for enhanced diagnostic tasks.
It employs techniques like masked image modeling and domain-transfer from general datasets to boost performance in segmentation and classification tasks.
Benchmarking reveals that RAD-DINO matches or outperforms language-supervised models, showing strong feature correlation with overlooked patient metadata.

Introduction to \raddino

The ongoing evolution in the AI field continues to showcase significant improvements in the use of deep learning models, particularly in sectors such as medical imaging. A common approach is to train these models using language-supervised pre-training, which involves using text to teach AI systems how to understand and classify images. While this has had considerable success, it also presents challenges, especially when detailed textual data is unavailable or when personal health information must be protected. Here, we introduce and evaluate \raddino, a new biomedical image encoder that breaks away from the norm by using unimodal biomedical imaging data for pre-training.

Beyond Text Supervision

\raddino challenges the traditional reliance on language supervision in the biomedical imaging domain. It presents an alternative approach where medical images are used to train an AI model without the accompanying text data. In assessments on various medical imaging tasks, including classification, semantic segmentation, and vision–language alignment, \raddino was found to perform similarly or better than existing language-supervised models.

Interestingly, \raddino also showcased an enhanced ability to correlate its features with additional medical records that are generally overlooked in radiology reports. This suggests that \raddino can potentially offer a broader and more holistic understanding of the clinical imagery compared to its text-supervised counterparts.

A Deeper Analysis

The researchers conducted comprehensive ablation studies to determine the factors contributing to \raddino’s impressive performance. These studies were essential to understand how the image encoder responds to various elements such as pre-training weights from general datasets, the role of masked image modeling, and the impact of image resolution.

Their results established that the beneficial domain-transfer from general image datasets laid a solid foundation for \raddino’s success. They also revealed that masked image modeling is particularly significant for image segmentation, demonstrating the importance of high-quality, domain-specific training data.

Benchmarking \raddino

\raddino’s effectiveness was benchmarked against a series of state-of-the-art models across multiple medical datasets. From image classification to the more complex application of generating text reports from medical images, \raddino held its own. In terms of correlation with patient metadata such as age and sex, which are typically not detailed in text reports, \raddino excelled. This marks an exciting step towards developing AI systems that can generalize better to a variety of real-world medical imaging applications.

Conclusion and Future Implications

The findings suggest a paradigm shift in the way foundational biomedical image encoders can be trained. By leveraging the vast amounts of imaging data while bypassing the restrictions of language supervision, \raddino opens up possibilities for medical AI applications that are more versatile, scalable, and perhaps more attuned to the nuanced needs of healthcare diagnostics. This paper serves as a compelling argument for the AI community to explore self-supervised learning avenues further, particularly in the crucial field of medical imaging.

PDF Markdown

Tweets

https://twitter.com/ozanoktay__/status/1749440100172197918

https://twitter.com/shreyjaineth/status/1795848302103241146

https://twitter.com/mkieffer1107/status/1880343410604986710

YouTube

Show All Videos