Low-resource finetuning of foundation models beats state-of-the-art in histopathology (2401.04720v1)

Published 9 Jan 2024 in cs.CV

Abstract: To handle the large scale of whole slide images in computational pathology, most approaches first tessellate the images into smaller patches, extract features from these patches, and finally aggregate the feature vectors with weakly-supervised learning. The performance of this workflow strongly depends on the quality of the extracted features. Recently, foundation models in computer vision showed that leveraging huge amounts of data through supervised or self-supervised learning improves feature quality and generalizability for a variety of tasks. In this study, we benchmark the most popular vision foundation models as feature extractors for histopathology data. We evaluate the models in two settings: slide-level classification and patch-level classification. We show that foundation models are a strong baseline. Our experiments demonstrate that by finetuning a foundation model on a single GPU for only two hours or three days depending on the dataset, we can match or outperform state-of-the-art feature extractors for computational pathology. These findings imply that even with little resources one can finetune a feature extractor tailored towards a specific downstream task and dataset. This is a considerable shift from the current state, where only few institutions with large amounts of resources and datasets are able to train a feature extractor. We publish all code used for training and evaluation as well as the finetuned models.

Authors (6)

Benedikt Roth (1 paper)
Valentin Koch (5 papers)
Sophia J. Wagner (13 papers)
Julia A. Schnabel (85 papers)
Carsten Marr (36 papers)
Tingying Peng (23 papers)

Citations (7)

View on Semantic Scholar

Summary

The paper demonstrates that fine-tuning foundation models, notably DINOv2, on histopathological data achieves state-of-the-art accuracy with minimal training resources.
It compares multiple models on slide-level and patch-level tasks, revealing that the smaller DINOv2 variant outperforms both its larger counterpart and specialized models.
The research highlights that a two-hour GPU training session can match 250-hour training benchmarks, paving the way for accessible advanced diagnostics.

Introduction

The power of AI in medical image analysis is rapidly evolving, especially in the field of histopathology, where the assessment of tissue slides can lead to crucial discoveries and diagnoses. This post explores groundbreaking research that showcases the substantial potential of leveraging foundation models, commonly used in computer vision, for histopathological data analysis. By fine-tuning these models with minimal resources, impressive gains can be achieved, surpassing current state-of-the-art methods in the domain.

Experiments & Methods

This research is rooted in the evaluation of prominent vision foundation models as feature extractors on histopathological data. In two different settings, slide-level and patch-level classification, the paper assesses popular models such as ResNet50, ImageBind, SAM, BEiT, and DINOv2 against histopathology-specific models like CTransPath and RetCCL. The models are tested on three colorectal cancer datasets—TCGA & CPTAC for slide-level classification and NCT-CRC for patch-level classification—to compare their efficiency and efficacy.

An intriguing aspect of this paper is its focus on the potential of DINOv2, a self-supervised teacher-student model that was originally trained on a large dataset of natural images, in the field of medical imaging. Evaluating its capabilities after being fine-tuned on task-specific datasets reveals its prowess and viability for medical applications.

Results

The findings are quite dramatic: the fine-tuned foundation model DINOv2 matches or even exceeds the performance of histopathology-specific feature extractors like CTransPath and RetCCL. Notably, the smaller variant of DINOv2 (ViT-S) outperformed the larger (ViT-g) across tasks. Moreover, the fine-tuned models required only a fraction of the computational resources and training time compared to other domain-specific models. With just two hours of training on a single GPU, DINOv2 was able to achieve comparable results to CTransPath, which demanded 250 hours of training on 48 NVIDIA V100 GPUs.

Conclusion

This pivotal research underscores a potentially transformative approach for histopathology: fine-tuning foundation models with minimal resources for specific tasks can rival or even outdo heavily resource-dependent, domain-specific feature extraction models. These foundation models, once fine-tuned, have demonstrated a remarkable capacity for adaptation to medical imaging tasks, suggesting that institutions with limited resources might gain access to state-of-the-art AI diagnostic tools. As the research was performed on a limited number of datasets, the team points towards the need for further validation across more varied benchmarks. Nevertheless, the initial results pave the way for broader applications and accessibility of advanced medical imaging analysis techniques.

Related Papers

Tweets

https://twitter.com/sophiajwagner/status/1745398536898814185