WangLab at MEDIQA-CORR 2024: Optimized LLM-based Programs for Medical Error Detection and Correction (2404.14544v1)

Published 22 Apr 2024 in cs.CL

Abstract: Medical errors in clinical text pose significant risks to patient safety. The MEDIQA-CORR 2024 shared task focuses on detecting and correcting these errors across three subtasks: identifying the presence of an error, extracting the erroneous sentence, and generating a corrected sentence. In this paper, we present our approach that achieved top performance in all three subtasks. For the MS dataset, which contains subtle errors, we developed a retrieval-based system leveraging external medical question-answering datasets. For the UW dataset, reflecting more realistic clinical notes, we created a pipeline of modules to detect, localize, and correct errors. Both approaches utilized the DSPy framework for optimizing prompts and few-shot examples in LLM based programs. Our results demonstrate the effectiveness of LLM based programs for medical error correction. However, our approach has limitations in addressing the full diversity of potential errors in medical documentation. We discuss the implications of our work and highlight future research directions to advance the robustness and applicability of medical error detection and correction systems.

Authors (5)

Augustin Toma (7 papers)
Ronald Xie (5 papers)
Steven Palayew (3 papers)
Patrick R. Lawler (2 papers)
Bo Wang (823 papers)

Citations (1)

View on Semantic Scholar

Summary

Effective Approaches to Detecting and Correcting Medical Errors in Clinical Texts Using LLMs

Introduction

The imperative to enhance patient safety through the accurate detection and correction of medical errors in clinical documentation has been incrementally recognized. Specifically, the utilization of AI and LLMs in this endeavor has shown promising capabilities. This paper elaborates on the methodologies employed to address the MEDIQA-CORR 2024 shared task, wherein the objective was to detect and correct errors in clinical texts. The task was structured around three subtasks: error detection, error sentence extraction, and error correction, with our approach achieving top performances in all categories.

Task Description

The MEDIQA-CORR 2024 shared task provided two distinct datasets, MS and UW, aimed at evaluating systems on error detection and correction in clinical notes. The differences in the datasets—the MS dataset containing subtle errors, often unnoticeable without deep analysis, and the UW dataset reflecting more apparent and realistic clinical errors—necessitated distinct approaches. Evaluation metrics varied per subtask, assessing systems based on their accuracy in detecting errors and the quality of corrections using a combination of ROUGE, BERTScore, and BLEURT among others.

Approach

Our methodology comprised two tailored approaches for the MS and UW datasets respectively:

For the MS dataset, a retrieval-based system was utilized. This system leveraged external medical question-answering datasets to identify and correct subtle errors. Techniques involved detecting related questions and their correct answers to ascertain and rectify errors in clinical texts.
For the UW dataset, a more direct approach was utilized involving a series of modules to detect, localize, and correct errors. This method proved effective given the more overt error types within these realistic clinical notes.

Both strategies incorporated the DSPy framework, facilitating the optimization of prompts and leveraging few-shot examples within LLMs like GPT-4.

Results and Discussion

The strategies employed demonstrated high efficacy across all subtasks. Our approach for the MS dataset leveraged a deep understanding of related medical questions from external databases to speculate and correct subtle errors. For the UW dataset, the sequential processing of detection, localization, and then correction allowed for systematic handling of errors.

In detail, performance metrics revealed an accuracy of 86.5% in detecting the presence of errors, and 83.6% accuracy in pinpointing the erroneous sentence in error detection tasks. Error correction tasks saw us achieving an aggregate score based on multiple evaluation metrics, demonstrating success in crafting appropriate corrections with context-appropriate accuracy.

Implications and Future Research

The implications of these successes are multifold. Theoretically, this research underlines the vast potential of LLMs in enhancing documentation accuracy and patient safety, by automating the detection and correction of errors in clinical notes. Practically, the application of differentially suited methodologies to datasets of varying complexity could guide the design of nuanced AI tools that are adaptable to the specifics of given medical documentation challenges.

Given the limitations in terms of the variety and complexity of medical errors that could be handled, future research could explore broader datasets encompassing a range of realistic errors. Advancements might also include refining LLM frameworks or integrating more domain-specific knowledge bases to further enhance the accuracy and relevance of error corrections.

Conclusion

Overall, the research presents a significant advance in employing AI, particularly LLMs, for the detection and correction of errors in clinical texts. While our methodologies have set a robust ground based on current tasks, the continuum of research and development will undoubtedly push the boundaries of what AI can achieve in supporting clinical documentation integrity and thereby, patient care standards.

PDF Markdown

Related Papers

Tweets

https://twitter.com/lateinteraction/status/1783990747257360779

https://twitter.com/lateinteraction/status/1784955050756014331