RadioRAG: Factual Large Language Models for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation (2407.15621v1)

Published 22 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have advanced the field of AI in medicine. However LLMs often generate outdated or inaccurate information based on static training datasets. Retrieval augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG) as an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RadioRAG is evaluated using a dedicated radiologic question-and-answer dataset (RadioQA). We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions, for which the correct gold-standard answers were available, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG. RadioRAG retrieved context-specific information from www.radiopaedia.org in real-time and incorporated them into its reply. RadioRAG consistently improved diagnostic accuracy across all LLMs, with relative improvements ranging from 2% to 54%. It matched or exceeded question answering without RAG across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in its effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. For radiology, RadioRAG establishes a robust framework that substantially improves diagnostic accuracy and factuality in radiological question answering.

Summary

The paper introduces a dynamic Retrieval Augmented Generation approach that boosts diagnostic accuracy by integrating real-time, authoritative radiological data.
It leverages key-phrase extraction and vector embeddings to retrieve precise context from sources like radiopaedia.org for informed LLM responses.
Results show accuracy improvements ranging from 2% to 47% across models, reinforcing its potential for cost-effective, real-time clinical decision support.

RadioRAG: Factual LLMs for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation

Introduction

The paper "RadioRAG: Factual LLMs for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation" investigates a novel implementation of Retrieval Augmented Generation (RAG) tailored for radiology-specific inquiries. The approach is designed to address persistent issues related to the factual accuracy and up-to-dateness of information generated by LLMs in the medical domain.

Motivation and Background

LLMs like GPT-4 and Llama3 have demonstrated potential in various facets of clinical workflows, from automated machine learning for clinical data interpretation to structured data extraction from free-text reports. Despite these advancements, one of the main persistent challenges is their reliance on static and potentially outdated training data, which can result in the generation of inaccurate or biased information. Conventional strategies such as human feedback mechanisms and prompt engineering do not fully mitigate these challenges. This necessitates an innovative approach to foster dynamic interaction with real-time data sources, leading to the conception of Retrieval Augmented Generation (RAG).

RadioRAG Framework

RadioRAG represents an end-to-end framework that leverages RAG to enhance diagnostic accuracy in radiology. Unlike preceding RAG systems that rely on pre-compiled static databases, RadioRAG dynamically retrieves and integrates information from authoritative radiological sources such as www.radiopaedia.org in real-time. The framework is assessed using two novel datasets: RSNA-RadioQA, derived from the Radiological Society of North America (RSNA) Case Collection, and RadioQA, an expert-curated dataset designed to minimize data contamination from training sets.

Methodology

The framework consists of multiple components:

Key-phrase Extraction: The system employs GPT-3.5-turbo to extract up to five key-phrases from user queries, enhancing the specificity and relevance of the subsequent retrieval process.
Online Context Retrieval: Using these key-phrases, the system searches relevant articles from radiopaedia.org, which are transformed into vector embeddings and stored in a dynamically created vector database.
Contextual Retriever: The user query is converted into a vector and compared with the stored vectors to retrieve the top three most similar contexts.
LLM Response Generation: The LLM is then prompted to provide answers leveraging the retrieved context, which increases the factuality and relevance of the response.

Evaluation

RadioRAG's efficacy was evaluated using a comprehensive dataset that spans multiple radiological subspecialties, including breast imaging, musculoskeletal, neuroradiology, and oncologic imaging.

Model Performance:

RadioRAG enhanced the diagnostic accuracy across all tested LLMs.
GPT-4 and GPT-3.5-turbo saw increases in diagnostic accuracy with improvements ranging from 2% to 11%.
Open-source models like Mixtral-8x7B-instruct-v0.1 and Llama3-8B demonstrated significant accuracy gains up to 47% and 33%, respectively, making them competitive with more complex models like GPT-4 in radiological contexts.

Statistical Analysis:

The use of bootstrapping with 10,000 redraws and adjusted p-values confirmed the statistical significance of the results.
RadioRAG's improvement in diagnostic accuracy, especially among open-source models, underlines its potential for cost-effective application in medical diagnostics without necessitating extensive retraining.

Implications and Future Work

The implications of RadioRAG are substantial. From a practical perspective, the framework offers a scalable solution for integrating real-time, authoritative data into LLMs to enhance the factual accuracy of medical diagnostics. Theoretically, RadioRAG provides insights into how LLMs can serve as dynamic reasoning engines rather than static repositories of pre-encoded knowledge. Future research directions include refining embedding functions and enhancing retrieval methodologies to further minimize inaccuracies. Additionally, optimization strategies to streamline real-time context retrieval processes and mitigate potential website load issues will be critical for clinical implementation.

Conclusion

RadioRAG sets a new benchmark for LLM applications in radiology by leveraging dynamic RAG to bridge the gap between static training data and real-time, factually accurate medical information. This framework not only enhances the diagnostic capabilities of LLMs but also paves the way for future developments in AI-driven diagnostics, significantly impacting clinical practices and patient care. The publicly available datasets—RSNA-RadioQA and RadioQA—further contribute to the transparency and reproducibility of research in this domain.

PDF Markdown

Related Papers

Tweets

https://twitter.com/woojinrad/status/1817564436602753315

https://twitter.com/starasteh/status/1816892566802956728

https://twitter.com/GptMaestro/status/1820475797091872964