- The paper introduces a probing evaluation with adversarial questions to reveal critical vulnerabilities in large multimodal models for medical VQA.
- It constructs the ProbMed dataset with 6,303 images and 57,132 QA pairs to assess performance across modality recognition, organ identification, and positional reasoning.
- Results show that top models like GPT-4V perform worse than random guessing under adversarial conditions, emphasizing the need for robust evaluation.
Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
Introduction
The paper investigates the performance of Large Multimodal Models (LMMs) in the context of medical Visual Question Answering (Med-VQA). While LMMs have demonstrated high accuracy on existing benchmarks, their reliability under probing evaluations has not been systematically assessed. This paper introduces a dataset, ProbMed, designed to rigorously evaluate LMMs using adversarial questioning, revealing significant vulnerabilities in their performance on specialized medical diagnosis tasks.
Probing Evaluation Methodology
The authors propose a simple probing evaluation technique that pairs original questions with adversarial, hallucinated attribute questions. This method challenges the robustness of LMMs by requiring precise differentiation between actual and hallucinated medical findings. The significant performance drops when faced with such questions highlight the models' unreliable handling of adversarial conditions.
Figure 1: Accuracy of four LMMs on two types of specialized questions in medical diagnoses, with and without adversarial pairs. The significant drop in accuracy with adversarial pairs highlights the models' unreliability in handling medical diagnoses.
The ProbMed Dataset
ProbMed is constructed from 6,303 images from the MedICaT and ChestX-ray14 datasets, encompassing a range of modalities and organs. With 57,132 question-answer pairs, ProbMed facilitates comprehensive assessments across multiple diagnostic dimensions, including modality recognition, organ identification, and positional reasoning. The dataset is designed to expose model weaknesses through adversarial pairs, providing a more robust evaluation framework.
Figure 2: Flow diagram of the ProbMed data curation process. Two comprehensive biomedical datasets were utilized to collect source data and construct a metadata file, enabling the automatic generation of high-quality question-answer pairs for the ProbMed dataset.
Experimental Results
The evaluation of seven state-of-the-art LMMs on ProbMed revealed that even top-performing models like GPT-4V and Gemini Pro exhibit performance levels worse than random guessing on specialized diagnostic questions. For example, the accuracy of identifying conditions and positions highlighted significant model limitations. The introduction of adversarial pairs drastically reduced model accuracies, underscoring the necessity of such testing to uncover latent vulnerabilities in Med-VQA.
Figure 3: An example illustrating the potential for misleading accuracy in existing evaluations. While the model correctly identifies the position of an existing finding in the standard evaluation, it fails to differentiate between actual and hallucinated positions when subjected to an adversarial evaluation.
Transferability of Domain Expertise
The paper also explored the transferability of domain-specific knowledge. CheXagent, trained specifically on chest X-rays, showed improved accuracy in identifying conditions on chest images across different modalities, suggesting potential for expertise transfer within specific domains. However, its capability to generalize this knowledge to other organs was limited.
Figure 4: Accuracy comparison of CheXagent in identifying organs and conditions/findings across different modalities. The model demonstrates significantly higher accuracy in identifying organs on chest images compared to images of other organs for both MRI and CT scans. Additionally, CheXagent shows improved accuracy in identifying conditions/findings on chest images, indicating the transferability of its specialized knowledge from chest X-ray training to other imaging modalities.
Conclusion
The findings reveal critical gaps in the reliability and robustness of LMMs for medical diagnosis. The insights from ProbMed highlight an urgent need to develop more robust evaluation methodologies and emphasize the importance of specialization and adversarial testing in real-world applications. As these models are integrated into sensitive fields like healthcare, ensuring their reliability and accuracy remains imperative.
This paper's conclusions point to the broader need for tailored evaluations and domain-specific training to enhance the applicability of LMMs in critical medical domains.