Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning (2206.02647v1)

Published 6 Jun 2022 in cs.CV

Abstract: Vision Transformers (ViTs) and their multi-scale and hierarchical variations have been successful at capturing image representations but their use has been generally studied for low-resolution images (e.g. - 256x256, 384384). For gigapixel whole-slide imaging (WSI) in computational pathology, WSIs can be as large as 150000x150000 pixels at 20X magnification and exhibit a hierarchical structure of visual tokens across varying resolutions: from 16x16 images capture spatial patterns among cells, to 4096x4096 images characterizing interactions within the tissue microenvironment. We introduce a new ViT architecture called the Hierarchical Image Pyramid Transformer (HIPT), which leverages the natural hierarchical structure inherent in WSIs using two levels of self-supervised learning to learn high-resolution image representations. HIPT is pretrained across 33 cancer types using 10,678 gigapixel WSIs, 408,218 4096x4096 images, and 104M 256x256 images. We benchmark HIPT representations on 9 slide-level tasks, and demonstrate that: 1) HIPT with hierarchical pretraining outperforms current state-of-the-art methods for cancer subtyping and survival prediction, 2) self-supervised ViTs are able to model important inductive biases about the hierarchical structure of phenotypes in the tumor microenvironment.

Citations (350)

View on Semantic Scholar

Summary

The paper presents HIPT, a hierarchical Vision Transformer that adapts to gigapixel whole-slide images through dual-stage self-supervised pretraining.
It aggregates multi-resolution visual tokens, enabling robust capture of fine and coarse image details crucial for histopathological analysis.
HIPT achieves superior performance in cancer subtyping and survival prediction, even with reduced training data, signaling new avenues for scalable deep learning in pathology.

Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning

The paper "Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning" presents a methodological advancement in the adaptation of Vision Transformers (ViTs) for high-resolution analysis of gigapixel whole-slide images (WSIs) in computational pathology. Traditional usage of ViTs has been confined largely to low-resolution images, but WSIs present a scale and complexity that necessitate innovative strategies to capture meaningful representations.

The authors introduce a hierarchical ViT architecture, the Hierarchical Image Pyramid Transformer (HIPT), which is specifically designed to manage the vast size and intricate structure of WSIs. HIPT leverages the hierarchical arrangement of visual tokens from fine-grained cellular details to macro-scale tissue interactions, achieved through self-supervised learning in two stages. This is realized by embedding visual tokens at multiple resolution levels— $16 \times 16$ , $256 \times 256$ , and $4096 \times 4096$ pixels—to encapsulate varying image contexts, crucial for comprehensive histopathological analysis.

Self-supervised pretraining is a significant cornerstone of this approach. HIPT is pretrained across a large dataset representing 33 cancer types, encompassing 10,678 gigapixel WSIs, from which 408,218 $4096 \times 4096$ images, and 104M $256 \times 256$ images were used. The pretrained representations are evaluated across 9 slide-level tasks, highlighting HIPT's superior performance over existing methodologies in cancer subtyping and survival prediction.

Numerically, HIPT demonstrates performance enhancements that are particularly noticeable in data-efficient scenarios, such as when only 25% of training data is used, showing potential for resource-constrained applications. In survival prediction tasks, HIPT’s capacity to model context-aware dependencies enables it to outperform previous models significantly, as evidenced in improvements in the concordance index for survival outcomes.

The paper's exploration into leveraging ViT blocks for both tokenization and aggregation layers through hierarchical pretraining elucidates a pathway for ViTs to capture multi-scale context and structural complexity in image data that mimics textual document representation in LLMs. This strategy facilitates the pretraining of high-resolution encodings up to $4096 \times 4096$ regions, aligning with computer vision tasks on natural images but scaled and adapted for histopathology.

Future developments may delve into optimizing HIPT's scalability for broader datasets and more diverse image modalities. The potential for refining slide-level representations through hierarchical integration opens compelling possibilities across various fields requiring granular analysis of large-scale visual data, beyond pathology.

Overall, this contribution of HIPT illustrates how hierarchical self-supervised learning harnesses structure at multiple scales, providing robust solutions to the vast computational challenges posed by gigapixel image data. It indicates a paradigm shift wherein the complexities of pathology are embraced through innovative deep learning architectures, pushing the boundaries of image representation learning.

PDF Markdown

Related Papers

Tweets

https://twitter.com/2311384392/status/1735992286398284271

YouTube

Show All Videos