Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning

Published 12 Oct 2022 in cs.CV, cs.AI, and cs.CL | (2210.06044v1)

Abstract: Learning medical visual representations directly from paired radiology reports has become an emerging topic in representation learning. However, existing medical image-text joint learning methods are limited by instance or local supervision analysis, ignoring disease-level semantic correspondences. In this paper, we present a novel Multi-Granularity Cross-modal Alignment (MGCA) framework for generalized medical visual representation learning by harnessing the naturally exhibited semantic correspondences between medical image and radiology reports at three different levels, i.e., pathological region-level, instance-level, and disease-level. Specifically, we first incorporate the instance-wise alignment module by maximizing the agreement between image-report pairs. Further, for token-wise alignment, we introduce a bidirectional cross-attention strategy to explicitly learn the matching between fine-grained visual tokens and text tokens, followed by contrastive learning to align them. More important, to leverage the high-level inter-subject relationship semantic (e.g., disease) correspondences, we design a novel cross-modal disease-level alignment paradigm to enforce the cross-modal cluster assignment consistency. Extensive experimental results on seven downstream medical image datasets covering image classification, object detection, and semantic segmentation tasks demonstrate the stable and superior performance of our framework.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (104)

View on Semantic Scholar

Summary

The paper introduces a three-level alignment strategy that aligns images and reports at pathological region, instance, and disease levels.
Experimental results show significant improvements in classification, detection, and segmentation tasks across multiple medical imaging datasets.
The MGCA framework demonstrates data efficiency by learning discriminative visual features even with limited labeled data.

The paper introduces the Multi-Granularity Cross-modal Alignment (MGCA) framework, aiming to improve medical visual representation learning by utilizing semantic correspondences across medical images and radiology reports. It adopts a three-level alignment approach—pathological region-level, instance-level, and disease-level—to enhance the representation capabilities for downstream tasks like image classification, object detection, and semantic segmentation. The framework demonstrates superior performance in these contexts, particularly with limited labeled data.

Introduction

Traditional deep learning and contrastive learning approaches have been effective in medical imaging tasks but struggle to incorporate the small yet crucial pathological areas in medical images and align them adequately with corresponding radiology reports. This paper addresses these limitations by proposing an MGCA framework that leverages multi-granularity semantic correspondences. With instance-wise, token-wise, and disease-level alignments, the MGCA framework aims to facilitate more efficient and generalizable visual learning, especially in environments with limited annotated data.

Figure 1: The multi-granularity (disease-level, instance-level, and pathological region-level) semantic correspondences across medical images and radiology reports.

Method

Multi-granularity Alignment Framework

MGCA employs a hierarchical vision-language semantic alignment across three levels:

Instance-wise Alignment (ITA): Uses InfoNCE losses to differentiate true medical image-report pairs from random pairs based on cosine similarity, focusing primarily on aligning image-report pairs globally.
Token-wise Alignment (CTA): Implements a bidirectional cross-attention mechanism to learn fine-grained token correspondences within image-text pairs, optimizing mutual information at the token level and providing better localized performance.
Disease-level Alignment (CPA): Introduces a cross-modal prototype mechanism that exploits high-level semantic consistency between images and reports, thereby capturing disease-level relationships across modalities.
Figure 2: Illustration of our proposed multi-granularity cross-modal alignment framework. CTA, ITA, and CPA represent token-wise alignment, instance-wise alignment, and prototype (disease)-level alignment respectively. The green arrow represents information flow of visual features, while the purple arrow represents information flow of textural features.

Experiments

Pre-training and Evaluation

MGCA was pre-trained using the large-scale MIMIC-CXR dataset, and its effectiveness was evaluated across seven datasets representing various tasks in medical imaging. The representation learning was fine-tuned on medical image classification, object detection, and semantic segmentation tasks.

Results

MGCA demonstrated substantial performance improvements over comparable frameworks such as GLoRIA and ConVIRT, particularly in environments with limited labeled data. For instance, the classification tasks on the COVIDx dataset showed significant accuracy improvements, asserting the framework's data-efficient capabilities. The object detection and segmentation tasks further exemplified MGCA’s ability to learn discriminative and localized representations.

Figure 3: Visualization of learned token correspondence by our MGCA. Highlighted pixels represent higher activation weights by corresponding word.

Conclusion

MGCA effectively exploits the inherent semantic structures between medical images and radiology reports across multiple levels, offering a robust alternative for generalized medical visual representation learning. Despite its promising application in medical diagnosis with reduced annotation requirements, future work could explore integration of discrimination and generation-based pre-training methods, as well as holistically leveraging the multi-granularity correspondences.

The proposed framework demonstrates strong potential for reducing radiologist workload and enhancing diagnostic efficiency, although caution should be exercised regarding usage with data containing personal information. Overall, MGCA represents a step forward in the application of AI to medical imaging, with significant implications for future research and healthcare improvement strategies.

Markdown Report Issue