Emergent Mind

Assessing The Potential Of Mid-Sized Language Models For Clinical QA

(2404.15894)
Published Apr 24, 2024 in cs.CL and cs.AI

Abstract

Large language models, such as GPT-4 and Med-PaLM, have shown impressive performance on clinical tasks; however, they require access to compute, are closed-source, and cannot be deployed on device. Mid-size models such as BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B avoid these drawbacks, but their capacity for clinical tasks has been understudied. To help assess their potential for clinical use and help researchers decide which model they should use, we compare their performance on two clinical question-answering (QA) tasks: MedQA and consumer query answering. We find that Mistral 7B is the best performing model, winning on all benchmarks and outperforming models trained specifically for the biomedical domain. While Mistral 7B's MedQA score of 63.0% approaches the original Med-PaLM, and it often can produce plausible responses to consumer health queries, room for improvement still exists. This study provides the first head-to-head assessment of open source mid-sized models on clinical tasks.

Various stages of a consumer-focused question-answering system developed by researchers.

Overview

  • The study focuses on evaluating the efficacy of mid-sized language models in healthcare-related QA, comparing their performance against larger, established models like GPT-4 and Med-PaLM.

  • Using two clinical QA datasets, MedQA and MultiMedQA, the paper investigates how well mid-sized models like BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B perform in generating clinically relevant answers.

  • Despite the promising results of Mistral 7B, the leading performer among mid-sized models, there remains a performance gap compared to larger models, underscoring the need for further refinements and model expansions.

Assessing the Efficacy of Mid-Sized Language Models on Clinical QA Tasks

Introduction

The utilization of LLMs in the healthcare sector has raised significant interest due to their promising applications in clinical question-answering (QA) tasks. Traditionally dominated by highly specialized large-scale models like GPT-4 and Med-PaLM, these platforms extend impressive capabilities but pose challenges including extensive computational demands, closed-source architectures, and their unsuitability for on-device deployment. This paper evaluates the performance of four mid-sized models: BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B in healthcare QA tasks to gauge their utility in clinical contexts without the constraints posed by larger models.

Evaluation Setup and Methods

Clinical QA Datasets

The models were rigorously tested on two primary QA benchmarks:

  • MedQA: Focuses on USMLE-style multiple-choice questions assessing medical knowledge and clinical decision-making.
  • MultiMedQA Long Form Answering: Engages models in generating paragraph-sized responses to consumer health queries which simulate real-world questions a patient might ask.

Model Training and Tuning

All models underwent fine-tuning on specific clinical datasets:

  • MedQA Training: Models were tuned to select correct answers from multiple-choice options based on provided clinical prompts. Fine-tuning was uniform across all models to maintain comparability.
  • MultiMedQA Training: Due to the lack of explicit training data, a novel dataset was curated from various online medical resources translating medical content into a question-response format suitable for model training.

Results and Performances

MedQA Task Performance

The best-performing model on the MedQA task was Mistral 7B, achieving a notable score of 63.0% post additional training with an expanded dataset. This compared favorably to the scores from dedicated biomedical models trained exclusively on niche datasets, yet still significantly lower than the capabilities exhibited by the largest models like GPT-4.

MultiMedQA Long Form Task Performance

The evaluation involved a detailed clinician review of generated responses across several metrics:

  • Completeness and Medical Accuracy: Mistral 7B demonstrated the highest competence, often generating the most comprehensive and medically appropriate responses.
  • Safety Metrics: Responses were assessed for potential harm and error propensity showing that while high, the performance of mid-sized models like Mistral 7B still indicates room for improvement to match the leading models like Med-PaLM 2.

Discussion

Despite Mistral 7B's leading performance among mid-sized models, the results underscore a gap that remains with higher echelon models. The insights suggest that while mid-sized models can offer practical alternatives with relative environmental and economic efficiency, they do not yet match the pinnacle performance of models exceeding tens of billions in parameter count.

Conclusion and Future Directions

The study importantly establishes a benchmark for mid-sized models in clinical QA tasks, suggesting that they hold potential yet require further advancements. Future explorations could investigate the integration of biomedical domain focus during initial model training, use of advanced training strategies like reinforcement learning, and expansion of model scale within computational feasibility.

This analysis leaves the field poised for continued innovation, where the accessibility of open-source, moderately scaled models may democratize advanced AI tools in clinical settings, providing substantial utility while managing cost and computational overhead.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.