Emergent Mind

Abstract

To handle the large scale of whole slide images in computational pathology, most approaches first tessellate the images into smaller patches, extract features from these patches, and finally aggregate the feature vectors with weakly-supervised learning. The performance of this workflow strongly depends on the quality of the extracted features. Recently, foundation models in computer vision showed that leveraging huge amounts of data through supervised or self-supervised learning improves feature quality and generalizability for a variety of tasks. In this study, we benchmark the most popular vision foundation models as feature extractors for histopathology data. We evaluate the models in two settings: slide-level classification and patch-level classification. We show that foundation models are a strong baseline. Our experiments demonstrate that by finetuning a foundation model on a single GPU for only two hours or three days depending on the dataset, we can match or outperform state-of-the-art feature extractors for computational pathology. These findings imply that even with little resources one can finetune a feature extractor tailored towards a specific downstream task and dataset. This is a considerable shift from the current state, where only few institutions with large amounts of resources and datasets are able to train a feature extractor. We publish all code used for training and evaluation as well as the finetuned models.

Overview

  • The study analyzes the performance of foundation models in histopathological image analysis and their effectiveness after being fine-tuned with low resources.

  • Two classification levels, slide-level and patch-level, were tested using models such as ResNet50, ImageBind, SAM, BEiT and DINOv2 against histopathology-specific models like CTransPath and RetCCL.

  • The research utilizes three colorectal cancer datasets—TCGA, CPTAC, and NCT-CRC—to evaluate model efficiency and efficacy.

  • DINOv2, particularly its smaller variant ViT-S, demonstrated superior performance while requiring significantly less computational resources and training time than domain-specific models.

  • Fine-tuning minimal resource models showed promise as an accessible approach for obtaining state-of-the-art diagnostic tools in medical imaging.

Introduction

The power of AI in medical image analysis is rapidly evolving, especially in the field of histopathology, where the assessment of tissue slides can lead to crucial discoveries and diagnoses. This post explore groundbreaking research that showcases the substantial potential of leveraging foundation models, commonly used in computer vision, for histopathological data analysis. By fine-tuning these models with minimal resources, impressive gains can be achieved, surpassing current state-of-the-art methods in the domain.

Experiments & Methods

This research is rooted in the evaluation of prominent vision foundation models as feature extractors on histopathological data. In two different settings, slide-level and patch-level classification, the study assesses popular models such as ResNet50, ImageBind, SAM, BEiT, and DINOv2 against histopathology-specific models like CTransPath and RetCCL. The models are tested on three colorectal cancer datasets—TCGA & CPTAC for slide-level classification and NCT-CRC for patch-level classification—to compare their efficiency and efficacy.

An intriguing aspect of this study is its focus on the potential of DINOv2, a self-supervised teacher-student model that was originally trained on a large dataset of natural images, in the realm of medical imaging. Evaluating its capabilities after being fine-tuned on task-specific datasets reveals its prowess and viability for medical applications.

Results

The findings are quite dramatic: the fine-tuned foundation model DINOv2 matches or even exceeds the performance of histopathology-specific feature extractors like CTransPath and RetCCL. Notably, the smaller variant of DINOv2 (ViT-S) outperformed the larger (ViT-g) across tasks. Moreover, the fine-tuned models required only a fraction of the computational resources and training time compared to other domain-specific models. With just two hours of training on a single GPU, DINOv2 was able to achieve comparable results to CTransPath, which demanded 250 hours of training on 48 NVIDIA V100 GPUs.

Conclusion

This pivotal research underscores a potentially transformative approach for histopathology: fine-tuning foundation models with minimal resources for specific tasks can rival or even outdo heavily resource-dependent, domain-specific feature extraction models. These foundation models, once fine-tuned, have demonstrated a remarkable capacity for adaptation to medical imaging tasks, suggesting that institutions with limited resources might gain access to state-of-the-art AI diagnostic tools. As the research was performed on a limited number of datasets, the team points towards the need for further validation across more varied benchmarks. Nevertheless, the initial results pave the way for broader applications and accessibility of advanced medical imaging analysis techniques.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.