MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering (2406.05845v1)

Published 9 Jun 2024 in cs.CL

Abstract: In recent years, LLMs have demonstrated an impressive ability to encode knowledge during pre-training on large text corpora. They can leverage this knowledge for downstream tasks like question answering (QA), even in complex areas involving health topics. Considering their high potential for facilitating clinical work in the future, understanding the quality of encoded medical knowledge and its recall in LLMs is an important step forward. In this study, we examine the capability of LLMs to exhibit medical knowledge recall by constructing a novel dataset derived from systematic reviews -- studies synthesizing evidence-based answers for specific medical questions. Through experiments on the new MedREQAL dataset, comprising question-answer pairs extracted from rigorous systematic reviews, we assess six LLMs, such as GPT and Mixtral, analyzing their classification and generation performance. Our experimental insights into LLM performance on the novel biomedical QA dataset reveal the still challenging nature of this task.

Summary

The paper provides a comprehensive evaluation of multiple LLMs in recalling medical evidence from systematic reviews.
It employs a novel QA dataset with 2,786 question-answer pairs and rigorous metrics like accuracy, F1, ROUGE, and BERTScore for assessment.
The study highlights model biases toward affirmative responses without evidence and underscores the potential of LLMs in clinical decision support.

MedREQAL: Examining Medical Knowledge Recall of LLMs via Question Answering

"MedREQAL: Examining Medical Knowledge Recall of LLMs via Question Answering" provides a comprehensive investigation into the proficiency of various LLMs in recalling and generating accurate medical knowledge. This paper, authored by Juraj Vladika, Phillip Schneider, and Florian Matthes from the Technical University of Munich, evaluates the performance of LLMs using a novel dataset constructed from systematic reviews, which are recognized for synthesizing high-quality evidence-based conclusions in the field of healthcare.

Introduction

The research was motivated by the impressive yet under-explored capability of LLMs to encode domain-specific knowledge during their pre-training on extensive text corpora. Given the increasing potential and application of LLMs in healthcare, it becomes imperative to scrutinize the quality and recall effectiveness of medical knowledge embedded within these models. Systematic reviews serve as an optimal source for this assessment due to their structured and rigorous approach in addressing clinical questions with synthesized evidence.

MedREQAL Dataset

The MedREQAL dataset was constructed from systematic reviews, specifically those conducted by the Cochrane Collaboration, a prominent organization dedicated to evidence-based healthcare. The dataset consists of 2,786 question-answer pairs, derived by automatically generating questions from the objectives of the systematic reviews and extracting conclusions as answers. Furthermore, a classification label (supported, refuted, or not enough information) was assigned to each question-answer pair to facilitate a structured evaluation.

Methodology

The evaluation involved six LLMs, including both general-purpose and biomedical-specific models. The selected models were GPT-4, Mistral-7B, Mixtral, PMC-LLaMa 13B, MedAlpaca 7B, and ChatDoctor 7B. Each model was evaluated in a zero-shot setting, where they were required to generate answers and classify the questions without additional context. The performance was measured using classification metrics (accuracy and F1 score) and natural language generation (NLG) metrics (ROUGE and BERTScore).

Results

The results highlighted distinct capabilities and limitations of the evaluated models. Notably, Mixtral outperformed both in classification (accuracy: 62.0, F1: 34.8) and generation (ROUGE-L: 21.1, BERTScore: 85.6), surpassing the otherwise highly regarded GPT-4. The disparity in model performance was attributed to the inclination of some models towards generating affirmative answers without sufficient evidence, as observed with GPT-4 and ChatDoctor. Mistral and Mixtral demonstrated a greater affinity towards cautious and evidence-based responses, which contributed to their superior performance.

Interestingly, biomedical models such as PMC-LLaMa showcased the ability to refer to specific randomized control trials, but struggled with comprehensive evidence synthesis, indicating the need for further refinement. The paper also revealed a fundamental challenge for LLMs in distinguishing between "refuted" and "not enough information" classes due to similar negative phrasing, which can lead to classification errors.

Implications and Future Work

The findings from this paper have several practical and theoretical implications. The moderate success in recalling medical evidence points towards the potential integration of LLMs in clinical decision support systems, provided that the models' propensity for generating definitive answers is mitigated. This underscores the importance of continuous improvement and frequent updating of the models' knowledge base to reflect the latest scientific evidence.

Furthermore, the MedREQAL dataset itself serves as a valuable resource for advancing research in biomedical question answering, providing a benchmark for developing more sophisticated retrieval-augmented generation methodologies and multi-document summarization techniques. Future research should explore refining the in-context learning capabilities of LLMs and developing mechanisms to ensure the currency of medical knowledge encoded within these models. Addressing the challenge of distinguishing between lack of evidence and evidence refutation remains a critical area for model enhancement.

Conclusion

The MedREQAL paper presents a robust framework for evaluating the medical knowledge recall abilities of LLMs and illuminates both their strengths and limitations. By leveraging a high-quality dataset derived from systematic reviews, this paper provides a rigorous and insightful analysis, advancing our understanding of LLM capabilities in the complex domain of healthcare. The results highlight the necessity for ongoing model refinement and pave the way for future innovations in medical AI applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/JurajVladika/status/1825658594215252196

https://twitter.com/JurajVladika/status/1800488968045334936