Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA (2405.20421v5)

Published 30 May 2024 in cs.AI

Abstract: Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions. To address this critical evaluation problem, we introduce the Probing Evaluation for Medical Diagnosis (ProbMed) dataset to rigorously assess LMM performance in medical imaging through probing evaluation and procedural diagnosis. Particularly, probing evaluation features pairing original questions with negation questions with hallucinated attributes, while procedural diagnosis requires reasoning across various diagnostic dimensions for each image, including modality recognition, organ identification, clinical findings, abnormalities, and positional grounding. Our evaluation reveals that top-performing models like GPT-4o, GPT-4V, and Gemini Pro perform worse than random guessing on specialized diagnostic questions, indicating significant limitations in handling fine-grained medical inquiries. Besides, models like LLaVA-Med struggle even with more general questions, and results from CheXagent demonstrate the transferability of expertise across different modalities of the same organ, showing that specialized domain knowledge is still crucial for improving performance. This study underscores the urgent need for more robust evaluation to ensure the reliability of LMMs in critical fields like medical diagnosis, and current LMMs are still far from applicable to those fields.

Citations (5)

View on Semantic Scholar

Summary

The paper introduces the ProbMed dataset with adversarial pairs to expose significant vulnerabilities in state-of-the-art medical VQA models.
It shows that models like GPT-4V and Gemini Pro experience drastic accuracy drops, with performance on specialized diagnostic tasks often near chance.
The research highlights that domain-specific training, as evidenced by CheXagent’s success on chest images, can markedly improve diagnostic reliability.

Probing Evaluation of Large Multimodal Models in Medical VQA

Introduction

The paper "Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA" presents a critical evaluation of the reliability of state-of-the-art Large Multimodal Models (LMMs) in the domain of medical Visual Question Answering (Med-VQA). Despite the high accuracy reported on existing benchmarks, this paper reveals significant performance issues when these models are tested under more rigorous conditions.

Methodology

To address the inadequacies in current evaluation methods, the authors introduced the Probing Evaluation for Medical Diagnosis (ProbMed) dataset. ProbMed is designed to rigorously assess LMM performance via two primary principles:

Probing Evaluation with Adversarial Pairs: This method pairs original questions with negation questions that contain hallucinated attributes, to test the models' ability to accurately distinguish between actual medical conditions and false attributes.
Procedural Diagnosis: This includes reasoning across multiple diagnostic dimensions such as modality recognition, organ identification, clinical findings, abnormalities, and positional grounding.

The dataset was curated from 6,303 biomedical images sourced from MedICaT and ChestX-ray14 datasets, covering various modalities and organs. Using GPT-4 and a positional reasoning module, metadata was generated to produce 57,132 high-quality question-answer pairs, ensuring a comprehensive evaluation across multiple diagnostic dimensions.

Results

Probing Evaluation Implications

The paper reveals a significant drop in accuracy when models like GPT-4V and Gemini Pro are subjected to adversarial questioning, performing worse than random guessing on specialized diagnostic questions. For instance, the inclusion of adversarial pairs in the VQA-RAD dataset test set resulted in average accuracy dropping from 77.11\% to 8.47\% for LLaVA-v1.6-7B, highlighting the LMMs' vulnerability to adversarial inputs.

In ProbMed, similar trends were observed:

GPT-4V and Gemini Pro experienced a decrease in accuracy with adversarial pairs, showing a performance drop by 10.52\% and 25.10\% respectively.
On more specialized diagnostic questions, even the top-performing models performed close to random guessing. For example, the accuracy in identifying conditions and their positions for GPT-4V were alarming, with accuracies of 35.19\% and 22.32\%.

Performance Across Diagnostic Questions

The categorical accuracy of different models in ProbMed indicates a significant gap in practical diagnostic capabilities:

GPT-4V and Gemini Pro outperform other models in general tasks like modality and organ recognition but struggle with fine-grained medical inquiries.
CheXagent, trained on chest X-rays, achieved high accuracy in determining abnormalities, indicating the importance of domain-specific training.

Error Analysis

An error analysis focusing on specialized question types (Abnormality, Condition/Finding, Position) revealed further vulnerabilities:

Both GPT-4V and Gemini Pro exhibited significant errors in hallucinated positions, with Gemini Pro's accuracy dropping to 26.40\% on positional questions.

Transferability of Domain Expertise

The paper confirms that specialized domain knowledge enhances performance across different modalities of the same organ. CheXagent, for instance, trained exclusively on chest X-rays, showed improved accuracy on CT and MRI images of the chest, but not on images of other organs.

Conclusion and Future Directions

This paper underscores the inadequacies of current evaluations, demonstrating that LMMs like GPT-4V and Gemini Pro are not yet reliable for critical tasks in medical diagnosis. The introduction of the ProbMed dataset and its robust evaluation methods highlight the critical need for developing more trustworthy AI systems in healthcare.

Future work should focus on:

Incorporating domain-specific expertise into LMM training.
Continuous performance monitoring and rigorous adversarial testing.
Expanding the evaluation to include open-ended tasks like medical report generation.

Implementing these methodologies will contribute to the development of AI systems that can be reliably integrated into real-world medical practice, ultimately improving diagnostic outcomes and patient care.

PDF Markdown

Related Papers

Tweets

https://twitter.com/xwang_lk/status/1797475354745197029

https://twitter.com/woojinrad/status/1798126762368684519

https://twitter.com/qianqi_yan/status/1797795379733528706

https://twitter.com/John_W_Maki/status/1797596612879147132

https://twitter.com/vieira_tiago_/status/1835576722508427487

https://twitter.com/artinspector/status/1808067629237592083

YouTube

Show All Videos