Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

124 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions (2402.18060v5)

Published 28 Feb 2024 in cs.CL

Abstract: LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exams or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets.\footnote{Datasets and code are available at \url{https://github.com/HanjieChen/ChallengeClinicalQA}.} JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. In-depth automatic and human evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.

References (56)

Citations (13)

View on Semantic Scholar

Summary

The paper introduces two novel datasets, JAMA Clinical Challenge and Medbullets, to rigorously test LLMs on complex clinical cases.
The evaluation of models like GPT-4 and PaLM 2 reveals significant performance drops on challenging questions compared to standard benchmarks.
The study highlights a misalignment between automatic and human evaluations of explanations, underscoring the need for better assessment metrics.

Benchmarking LLMs on Challenging Medical Question Answering

The paper "Benchmarking LLMs on Answering and Explaining Challenging Medical Questions" addresses the capabilities of LLMs in the domain of medical question answering, specifically focusing on their ability to handle complex clinical cases and provide coherent explanations. The authors introduce two novel datasets, JAMA Clinical Challenge and Medbullets, which aim to evaluate the proficiency of LLMs in more realistic and demanding medical scenarios than those posed by traditional benchmarks like medical licensing exams.

Overview and Motivation

Medical question answering is a critical area where LLMs have shown promise by achieving impressive scores on standard medical examinations, such as the United States Medical Licensing Examination (USMLE). However, these exams often rely on textbook knowledge and do not adequately simulate the intricacies of real-world clinical cases where nuanced reasoning and the interpretation of complex scenarios are required. The paper posits that merely achieving high accuracy on board exams is insufficient for these models to support clinical decision-making in practice.

To further the field, the authors focus on two key improvements: increasing the challenge level of testing datasets to better reflect realistic medical situations and incorporating expert-written explanations to assess the reasoning capabilities of LLMs. The lack of reliable reference explanations in existing datasets impedes evaluating the explainability of model predictions, a crucial aspect of their utility in clinical applications.

Dataset Construction and Description

The paper introduces two datasets:

JAMA Clinical Challenge: Comprising 1,524 clinical cases curated from the JAMA Network Clinical Challenge archive, this dataset presents challenging real-world cases requiring detailed reasoning and diagnostic skills. Each case includes a comprehensive clinical vignette, a question, multiple-choice answers, and detailed expert-written explanations.
Medbullets: Harvested from publicly available USMLE Step 2/3 style questions, this dataset consists of 308 questions, each accompanied by a clinical scenario, multiple answer options, and explanations. The questions are designed to mirror common clinical situations, testing the ability of LLMs to apply clinical reasoning effectively.

These datasets are not only larger than previous ones but also come with high-quality explanations, making them invaluable resources for training and evaluating the next generation of medical LLMs.

Evaluation of LLMs

The authors evaluated four LLMs: GPT-3.5, GPT-4, PaLM 2, and Llama 2, using the newly constructed datasets. The evaluation involved testing the models' ability to predict answers and generate explanations using different prompting strategies, including zero-shot and chain-of-thought (CoT) prompting.

Findings on Prediction Accuracy: The results highlight a significant challenge posed by the new datasets, with performance drops observed across all models compared to traditional benchmarks. GPT-4 demonstrated superior performance overall, indicating its robustness in handling complex clinical questions.
Chain-of-Thought and In-Context Learning: The experiments suggest that CoT prompting enhances model reasoning capabilities by encouraging step-by-step analysis. However, this improvement was marginal for the most challenging questions from the JAMA dataset. In-context learning showed benefits mainly for GPT-4, with other models displaying limited adaptability to new tasks through this mechanism.

Explanation Evaluation and Human Alignment

One of the paper's pivotal contributions is the assessment of model-generated explanations. The authors utilized automatic metrics like ROUGE-L and BARTScore and conducted human evaluations to gauge the quality of explanations.

Automatic vs. Human Evaluation: The paper found notable discrepancies between automatic metrics and human judgments, with human evaluators often preferring explanations generated by models that did not score highest on automated measures. This misalignment underscores the need for developing more reliable evaluation metrics that better capture qualitative aspects valued in medical reasoning.

Implications and Future Directions

The introduction of these datasets sets a new standard for evaluating LLMs in the medical domain, pushing beyond mere knowledge recall to assess practical reasoning and explanation generation. The research suggests several avenues for future work, including refining evaluation metrics for explanations, exploring more sophisticated prompting strategies, and integrating multimodal capabilities to address cases involving visual data, such as X-rays.

By establishing a robust benchmark for challenging medical QA, the paper paves the way for developing LLMs that are not only accurate but also capable of providing insightful and trustworthy support in clinical decision-making.

PDF Markdown

Tweets

https://twitter.com/hanjie_chen/status/1763638365503758484