Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

175 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Benchmarking Retrieval-Augmented Generation for Medicine (2402.13178v2)

Published 20 Feb 2024 in cs.CL and cs.AI

Abstract: While LLMs have achieved state-of-the-art performance on a wide range of medical question answering (QA) tasks, they still face challenges with hallucinations and outdated knowledge. Retrieval-augmented generation (RAG) is a promising solution and has been widely adopted. However, a RAG system can involve multiple flexible components, and there is a lack of best practices regarding the optimal RAG setting for various medical purposes. To systematically evaluate such systems, we propose the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets. Using MIRAGE, we conducted large-scale experiments with over 1.8 trillion prompt tokens on 41 combinations of different corpora, retrievers, and backbone LLMs through the MedRAG toolkit introduced in this work. Overall, MedRAG improves the accuracy of six different LLMs by up to 18% over chain-of-thought prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. Our results show that the combination of various medical corpora and retrievers achieves the best performance. In addition, we discovered a log-linear scaling property and the "lost-in-the-middle" effects in medical RAG. We believe our comprehensive evaluations can serve as practical guidelines for implementing RAG systems for medicine.

References (66)

Citations (85)

View on Semantic Scholar

Summary

The paper demonstrates that Retrieval-Augmented Generation improves medical QA accuracy by up to 18% over traditional chain-of-thought methods.
It introduces the Mirage benchmark and MedRag toolkit, enabling a systematic evaluation of various RAG configurations across multiple corpora and retrievers.
The study reveals the critical impact of corpus choice and retrieval strategies, including a log-linear performance scaling and a 'lost-in-the-middle' effect.

Benchmarking Retrieval-Augmented Generation for Medical Question Answering

Introduction to Retrieval-Augmented Generation (RAG) in Medicine

Recent advancements in LLMs have significantly contributed to enhancing medical question answering (QA) systems. However, challenges such as the generation of inaccurate information ("hallucinations") and the use of outdated knowledge persist, raising concerns particularly in high-stakes fields like healthcare. Retrieval-Augmented Generation (RAG) has emerged as a promising approach to mitigate these issues by grounding LLM responses in relevant, retrieved documents from trustworthy sources. The flexibility inherent in RAG systems, due to their modular nature comprising of retrievers, corpora, and LLM backbones, mandates a comprehensive evaluation to delineate best practices for their implementation in medical contexts.

The Mirage Benchmark and MedRag Toolkit

To address this need for systematic evaluation, the Medical Information Retrieval-Augmented Generation Evaluation (Mirage) benchmark was introduced. Comprising 7,663 questions from five essential medical QA datasets, Mirage facilitates the examination of RAG systems' zero-shot capabilities across various medical question types. Alongside Mirage, a toolkit named MedRag was proposed, offering an accessible means to configure and test different combinations of RAG components, consisting of five distinct corpora, four retrieval algorithms, and six LLMs. This toolkit not only aids in the practical application of RAG systems in medicine but also in conducting large-scale, nuanced analyses to uncover correlations between system configurations and their performance on the benchmark.

Insights from the Evaluation

The evaluation of RAG systems using Mirage surfaced several key findings:

A significant enhancement in LLM performance, by up to 18%, was observed when employing RAG over traditional chain-of-thought prompting. Remarkably, certain configurations enabled GPT-3.5 and Mixtral models to rival the performance of their more advanced counterpart, GPT-4.
Preference for retrieval corpora varied with the task, highlighting the importance of corpus selection in RAG system configuration. The comprehensive MedCorp corpus, amalgamating multiple sources, emerged as a robust option across tasks, suggesting the value in cross-source retrieval.
Among retrievers, domain-specific options like MedCPT showed superior performance in medical contexts. The implementation of fusion methods, such as Reciprocal Rank Fusion, further improved retrieval outcomes by aggregating results from multiple retrievers.
The paper unveiled scaling properties indicating a log-linear relationship between model performance and the number of retrieved snippets. A "lost-in-the-middle" effect was identified, underscoring the nuanced impact of snippet positioning on answer accuracy.

Future Directions and Recommendations

The extensive analysis provided by the Mirage benchmark and MedRag toolkit lays the groundwork for future research and the refinement of medical RAG systems. Based on the results, several practical recommendations were proposed, including the selection of comprehensive corpora like MedCorp and the employment of domain-specific retrievers, especially in tasks where relevant literature is paramount.

Moreover, the observed performance scaling and snippet positioning effects invite further exploration into the optimization of retrieval depth and order. Additionally, the feasibility of incorporating newer RAG architectures and other potentially beneficial resources into MedRag presents promising avenues for enhancing the model's utility and reliability in medical QA.

Conclusion

In conclusion, the introduction of Mirage and MedRag represents a significant stride towards the optimization of RAG systems for medical question answering. Through systematic benchmarking, this work illuminates the pathways through which RAG configurations can be tailored to maximize accuracy and reliability in medical QA, marking an essential contribution to the field of computational healthcare.

PDF Markdown

Tweets

https://twitter.com/DrQiaoJin/status/1760131318426304597

https://twitter.com/_reachsumit/status/1760200130068894007

https://twitter.com/GuangzhiXiong/status/1782435464793731227

https://twitter.com/katieelink/status/1760965814440874268

https://twitter.com/GuangzhiXiong/status/1779701318644379721

https://twitter.com/fly51fly/status/1760440039329415527