CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making (2406.09923v2)

Published 14 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The integration of AI, especially LLMs, into the clinical diagnosis process offers significant potential to improve the efficiency and accessibility of medical care. While LLMs have shown some promise in the medical domain, their application in clinical diagnosis remains underexplored, especially in real-world clinical practice, where highly sophisticated, patient-specific decisions need to be made. Current evaluations of LLMs in this field are often narrow in scope, focusing on specific diseases or specialties and employing simplified diagnostic tasks. To bridge this gap, we introduce CliBench, a novel benchmark developed from the MIMIC IV dataset, offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnoses from a diverse range of medical cases across various specialties but also incorporates tasks of clinical significance: treatment procedure identification, lab test ordering and medication prescriptions. Supported by structured output ontologies, CliBench enables a precise and multi-granular evaluation, offering an in-depth understanding of LLM's capability on diverse clinical tasks of desired granularity. We conduct a zero-shot evaluation of leading LLMs to assess their proficiency in clinical decision-making. Our preliminary results shed light on the potential and limitations of current LLMs in clinical settings, providing valuable insights for future advancements in LLM-powered healthcare.

Citations (3)

View on Semantic Scholar

Summary

The paper presents a comprehensive benchmark assessing LLMs in clinical decision tasks, covering diagnoses, procedures, lab tests, and prescriptions.
It employs multifaceted evaluation using real-world MIMIC IV data, revealing performance differences among models like GPT-4 and LLaMA.
Findings highlight the critical role of instruction tuning and domain adaptation in enhancing LLM reliability for clinical applications.

Overview of CliBench: Multifaceted Evaluation of LLMs in Clinical Decisions

The paper, "CliBench: Multifaceted Evaluation of LLMs in Clinical Decisions on Diagnoses, Procedures, Lab Tests Orders and Prescriptions" presents a comprehensive benchmark designed to evaluate the application of LLMs in clinical settings. This work emerges from the University of California, Los Angeles, and represents a significant methodological advancement in understanding LLM capabilities in realistic, patient-specific clinical decision-making tasks.

Core Contributions

Benchmark Design
- Multifaceted Task Coverage:
  - Diagnosis Decisions: Evaluation involves identifying diseases using ICD-10-CM codes based on patient profiles, medical records at admission, and lab and radiology results.
  - Procedure Identification: Tasked with determining the initial clinical procedures (ICD-10-PCS codes) post-admission.
  - Lab Test Orders: Predicting necessary lab tests ordered using LOINC codes.
  - Medication Prescriptions: Generating initial medication prescriptions classified by ATC codes.
Data Source and Preparation:
- CliBench utilizes the MIMIC IV dataset, extending beyond previous benchmarks by offering a broader spectrum of real-world clinical cases from multiple specialties. The data extraction process ensures comprehensive inclusion of clinical contexts and structured ontological mappings, addressing the usual narrow scope of existing evaluations.
Hierarchical and Multi-granular Evaluation:
- The evaluation leverages multiple levels of granularity, reflecting the complexity and specificity required for clinical practice. This includes levels ranging from broad diagnoses to detailed sub-categories, mirroring the ICD-10-CM, ICD-10-PCS, LOINC, and ATC hierarchies.
Zero-shot and Fine-tuned Evaluations:
- The authors conducted zero-shot assessments using various LLMs such as Mistral, LLaMA, and GPT series. The results offer insights into the inherent and fine-tuned capabilities of these models in clinical tasks.

Key Results

Performance Differentials:
- The evaluation highlighted that current state-of-the-art LLMs are challenged by the intricate nature of clinical diagnosis tasks, with varying degrees of success across models and task types.
- GPT-4o showcased superior performance in fine-grained diagnosis levels (F1 score 27.58% for full code matching) compared to open models like LLaMA3 Instruct 70B (F1 score 20.21% for full code matching).
Instruction Tuning Importance:
- Instruction-tuned models consistently outperformed their non-tuned counterparts, underscoring the necessity of such tuning for clinical applications.
Domain-Specialized Models:
- The evaluation of domain-specialized models, such as BioMistral DARE and Meditron, revealed that these models did not consistently outperform general instruction-tuned models, suggesting room for improvement in domain adaptation strategies.
Precision and Recall Tradeoffs:
- A noticeable tradeoff between precision and recall was observed among models. For example, GPT-4 turbo preferred higher recall approaches leading to more extensive search spaces, while other models like GPT-3.5 turbo were more conservative.

Implications and Future Directions

The introduction of CliBench provides a robust tool for evaluating LLMs in a realistic clinical decision-making environment. This benchmark addresses previous limitations related to narrow scopes and simplified tasks, offering a more granular and patient-specific assessment of LLMs. The insights gained from this paper have several implications:

Clinical Integration:
- Current LLMs, while showing promise, particularly in generic appropriation tasks, still necessitate significant improvements for reliable clinical use. The detailed performance metrics provided by CliBench can guide future model enhancements for practical deployment in healthcare.
Enhanced Model Training:
- Future developments should focus on improved instruction tuning and domain adaptation methods tailored specifically for clinical knowledge. Preference optimization and supervised fine-tuning could be crucial in advancing LLM capabilities.
Diverse and Comprehensive Datasets:
- Given that domain-specific models did not outperform expectations, training datasets must be diversified to include a wider range of clinical scenarios, particularly those that mirror real-world complexities.
Programming and Interface Development:
- Integration of LLMs in clinical settings will likely benefit from Retrieval-Augmented Generation (RAG) methods or agent-based systems utilizing API calls and enhanced tool functions to assist in real-time clinical decision support.

CliBench sets a high standard for future research by demanding rigorous and multifaceted evaluations that better reflect the practical challenges faced in clinical environments. As LLMs continue to evolve, this benchmark will play a pivotal role in ensuring their safe, effective, and reliable application in healthcare.

By establishing stringent evaluation parameters and analyzing the performance of leading LLMs, this paper provides a critical foundation for future research aimed at bridging the gap between AI capabilities and clinical necessities. As the field evolves, leveraging benchmarks like CliBench will be integral to developing robust and clinically viable AI solutions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/iScienceLuvr/status/1802636280276406689

https://twitter.com/OpenlifesciAI/status/1847169057268928804

https://twitter.com/OpenlifesciAI/status/1846113412683911555

https://twitter.com/knishimae0531/status/1802673896023625890