Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 143 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA (2405.20421v5)

Published 30 May 2024 in cs.AI

Abstract: Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions. To address this critical evaluation problem, we introduce the Probing Evaluation for Medical Diagnosis (ProbMed) dataset to rigorously assess LMM performance in medical imaging through probing evaluation and procedural diagnosis. Particularly, probing evaluation features pairing original questions with negation questions with hallucinated attributes, while procedural diagnosis requires reasoning across various diagnostic dimensions for each image, including modality recognition, organ identification, clinical findings, abnormalities, and positional grounding. Our evaluation reveals that top-performing models like GPT-4o, GPT-4V, and Gemini Pro perform worse than random guessing on specialized diagnostic questions, indicating significant limitations in handling fine-grained medical inquiries. Besides, models like LLaVA-Med struggle even with more general questions, and results from CheXagent demonstrate the transferability of expertise across different modalities of the same organ, showing that specialized domain knowledge is still crucial for improving performance. This study underscores the urgent need for more robust evaluation to ensure the reliability of LMMs in critical fields like medical diagnosis, and current LMMs are still far from applicable to those fields.

Citations (5)

Summary

  • The paper introduces a probing evaluation with adversarial questions to reveal critical vulnerabilities in large multimodal models for medical VQA.
  • It constructs the ProbMed dataset with 6,303 images and 57,132 QA pairs to assess performance across modality recognition, organ identification, and positional reasoning.
  • Results show that top models like GPT-4V perform worse than random guessing under adversarial conditions, emphasizing the need for robust evaluation.

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

Introduction

The paper investigates the performance of Large Multimodal Models (LMMs) in the context of medical Visual Question Answering (Med-VQA). While LMMs have demonstrated high accuracy on existing benchmarks, their reliability under probing evaluations has not been systematically assessed. This paper introduces a dataset, ProbMed, designed to rigorously evaluate LMMs using adversarial questioning, revealing significant vulnerabilities in their performance on specialized medical diagnosis tasks.

Probing Evaluation Methodology

The authors propose a simple probing evaluation technique that pairs original questions with adversarial, hallucinated attribute questions. This method challenges the robustness of LMMs by requiring precise differentiation between actual and hallucinated medical findings. The significant performance drops when faced with such questions highlight the models' unreliable handling of adversarial conditions. Figure 1

Figure 1: Accuracy of four LMMs on two types of specialized questions in medical diagnoses, with and without adversarial pairs. The significant drop in accuracy with adversarial pairs highlights the models' unreliability in handling medical diagnoses.

The ProbMed Dataset

ProbMed is constructed from 6,303 images from the MedICaT and ChestX-ray14 datasets, encompassing a range of modalities and organs. With 57,132 question-answer pairs, ProbMed facilitates comprehensive assessments across multiple diagnostic dimensions, including modality recognition, organ identification, and positional reasoning. The dataset is designed to expose model weaknesses through adversarial pairs, providing a more robust evaluation framework. Figure 2

Figure 2: Flow diagram of the ProbMed data curation process. Two comprehensive biomedical datasets were utilized to collect source data and construct a metadata file, enabling the automatic generation of high-quality question-answer pairs for the ProbMed dataset.

Experimental Results

The evaluation of seven state-of-the-art LMMs on ProbMed revealed that even top-performing models like GPT-4V and Gemini Pro exhibit performance levels worse than random guessing on specialized diagnostic questions. For example, the accuracy of identifying conditions and positions highlighted significant model limitations. The introduction of adversarial pairs drastically reduced model accuracies, underscoring the necessity of such testing to uncover latent vulnerabilities in Med-VQA. Figure 3

Figure 3: An example illustrating the potential for misleading accuracy in existing evaluations. While the model correctly identifies the position of an existing finding in the standard evaluation, it fails to differentiate between actual and hallucinated positions when subjected to an adversarial evaluation.

Transferability of Domain Expertise

The paper also explored the transferability of domain-specific knowledge. CheXagent, trained specifically on chest X-rays, showed improved accuracy in identifying conditions on chest images across different modalities, suggesting potential for expertise transfer within specific domains. However, its capability to generalize this knowledge to other organs was limited. Figure 4

Figure 4: Accuracy comparison of CheXagent in identifying organs and conditions/findings across different modalities. The model demonstrates significantly higher accuracy in identifying organs on chest images compared to images of other organs for both MRI and CT scans. Additionally, CheXagent shows improved accuracy in identifying conditions/findings on chest images, indicating the transferability of its specialized knowledge from chest X-ray training to other imaging modalities.

Conclusion

The findings reveal critical gaps in the reliability and robustness of LMMs for medical diagnosis. The insights from ProbMed highlight an urgent need to develop more robust evaluation methodologies and emphasize the importance of specialization and adversarial testing in real-world applications. As these models are integrated into sensitive fields like healthcare, ensuring their reliability and accuracy remains imperative.

This paper's conclusions point to the broader need for tailored evaluations and domain-specific training to enhance the applicability of LMMs in critical medical domains.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 8 tweets and received 484 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com