Emergent Mind

Abstract

Background: Evidence-based medicine (EBM) is fundamental to modern clinical practice, requiring clinicians to continually update their knowledge and apply the best clinical evidence in patient care. The practice of EBM faces challenges due to rapid advancements in medical research, leading to information overload for clinicians. The integration of AI, specifically Generative LLMs, offers a promising solution towards managing this complexity. Methods: This study involved the curation of real-world clinical cases across various specialties, converting them into .json files for analysis. LLMs, including proprietary models like ChatGPT 3.5 and 4, Gemini Pro, and open-source models like LLaMA v2 and Mixtral-8x7B, were employed. These models were equipped with tools to retrieve information from case files and make clinical decisions similar to how clinicians must operate in the real world. Model performance was evaluated based on correctness of final answer, judicious use of tools, conformity to guidelines, and resistance to hallucinations. Results: GPT-4 was most capable of autonomous operation in a clinical setting, being generally more effective in ordering relevant investigations and conforming to clinical guidelines. Limitations were observed in terms of model ability to handle complex guidelines and diagnostic nuances. Retrieval Augmented Generation made recommendations more tailored to patients and healthcare systems. Conclusions: LLMs can be made to function as autonomous practitioners of evidence-based medicine. Their ability to utilize tooling can be harnessed to interact with the infrastructure of a real-world healthcare system and perform the tasks of patient management in a guideline directed manner. Prompt engineering may help to further enhance this potential and transform healthcare for the clinician and the patient.

Overview

  • The study explores using Generative LLMs as autonomous agents in evidence-based clinical practice, evaluating models like GPT-4 and proprietary and open-source counterparts.

  • Clinical cases were structured into .json files for assessment, and LLMs were evaluated on metrics such as answer correctness, judicious tool use, guideline conformity, and hallucination resistance.

  • GPT-4 outperformed other models in multiple specialties, particularly in correctness and tool use, and showed marked improvement with Retrieval-Augmented Generation (RAG) in adhering to clinical guidelines.

Generative LLMs as Autonomous Practitioners in Evidence-Based Medicine

The paper "Generative LLMs are Autonomous Practitioners of Evidence-Based Medicine" investigates the utilization of Generative LLMs as autonomous agents in evidence-based clinical practice. The study leverages the problem-solving and reasoning abilities of LLMs to manage real-world clinical cases autonomously, incorporating prompt engineering, diagnostic tooling, retrieval-augmented generation (RAG), and established clinical guidelines.

Methods

The study curated real-world clinical cases across multiple medical specialties and converted these cases into structured .json files. These files encompassed clinically relevant information such as patient symptoms, signs, past medical history, and recordings from lab tests or imaging studies, matched with questions aimed at finding the best next steps in patient management.

The LLMs used in this study included both proprietary models (ChatGPT-3.5, GPT-4, Gemini Pro) and open-source models (LLaMA v2-70B, Mixtral-8x7B). These models were evaluated on four metrics: correctness of the final answer, judicious use of tools, conformity to guidelines, and resistance to hallucinations. Performance was assessed across categories by specialty and case difficulty.

Results

Correctness of Final Answer

GPT-4 demonstrated superior performance compared to other models, excelling in Cardiology (80% correctness), Genetics (100%), and Critical Care (100%). With more complex cases, all models showed a decline in performance, with proprietary models generally outperforming open-source ones.

Judicious Use of Tools

GPT-4 also excelled in the judicious use of diagnostic tools, maintaining logical and directed use across most specialties. The model's precision in selecting relevant investigations was highlighted, outperforming others particularly in Cardiology and Genetics. Identity reshaping via prompt engineering showed notable effects, as demonstrated in the modified behavior when models took on the role of a “Clinical Geneticist”.

Conformity to Guidelines

With RAG enabled, GPT-4 showed a marked improvement in guideline adherence, achieving an average performance increase of ~10% over other models. RAG significantly enhanced the model's ability to tailor recommendations based on specifically retrieved guidelines, although conforming to complex guidelines remained a challenge.

Resistance to Hallucinations

All models exhibited minimal hallucinations, with GPT-3.5 performing best overall. Errors were predominantly related to incorrect naming of laboratory tests. The open-source models displayed more hallucinatory tendencies in Emergency Medicine cases, specifically LLaMA-70B which scored poorly in this regard.

Discussion

This research underscores that LLMs have vast potential beyond their role as medical databases. They can reason and autonomously navigate clinical scenarios, akin to a clinician practicing evidence-based medicine. Their ability to perform next-word prediction extends to making informed clinical decisions by iteratively building up patient context via tool utilization.

The implications of this study are significant for clinical decision support systems, especially in resource-constrained settings. LLMs can serve as triage specialists or be the first point of patient contact, synthesizing patient history and clinical findings to inform subsequent care. Additionally, the models can alleviate clinician workload by summarizing patient records, thereby mitigating information overload.

The use of RAG highlights a critical element in medical applications of LLMs—ongoing updates and accurate contextual fetching are essential due to the evolving nature of medical knowledge. Models equipped with RAG can integrate up-to-date information dynamically, enhancing their clinical relevance.

Future work will focus on integrating larger, more sophisticated multi-modal models capable of handling text, images, and videos. These advancements will further improve the accuracy and reliability of LLMs in real-world clinical settings. Additionally, work on reducing the propensity for hallucinations through refined prompt engineering and systematic updates will be essential.

In conclusion, the study demonstrates that LLMs, including advanced models like GPT-4, can act as promising autonomous practitioners in evidence-based medicine. By seamlessly integrating with healthcare infrastructures through tools and enhanced capabilities such as RAG, these models provide a transformative approach to clinical practice, ultimately benefiting both clinicians and patients.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.