Emergent Mind

Benchmarking Retrieval-Augmented Generation for Medicine

Published Feb 20, 2024 in cs.CL and cs.AI


While LLMs have achieved state-of-the-art performance on a wide range of medical question answering (QA) tasks, they still face challenges with hallucinations and outdated knowledge. Retrieval-augmented generation (RAG) is a promising solution and has been widely adopted. However, a RAG system can involve multiple flexible components, and there is a lack of best practices regarding the optimal RAG setting for various medical purposes. To systematically evaluate such systems, we propose the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets. Using MIRAGE, we conducted large-scale experiments with over 1.8 trillion prompt tokens on 41 combinations of different corpora, retrievers, and backbone LLMs through the MedRAG toolkit introduced in this work. Overall, MedRAG improves the accuracy of six different LLMs by up to 18% over chain-of-thought prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. Our results show that the combination of various medical corpora and retrievers achieves the best performance. In addition, we discovered a log-linear scaling property and the "lost-in-the-middle" effects in medical RAG. We believe our comprehensive evaluations can serve as practical guidelines for implementing RAG systems for medicine.

Overview of the MedRag toolkit's components.


  • The paper introduces the Mirage benchmark and MedRag toolkit for evaluating Retrieval-Augmented Generation (RAG) systems in medical question answering, highlighting the need for comprehensive evaluation due to RAG's modular nature.

  • It discusses improvements in LLMs performance when using RAG, noting enhancements of up to 18% over traditional methods and the importance of selecting the right corpus and retriever configurations.

  • Key findings include the effectiveness of comprehensive and domain-specific corpora and retrievers in medical contexts, the relevance of retrieval depth, and the impact of snippet positioning on answer accuracy.

  • The paper concludes with recommendations for future RAG system development in medicine, including corpus and retriever selections and the exploration of new architectures to improve medical QA systems.

Benchmarking Retrieval-Augmented Generation for Medical Question Answering

Introduction to Retrieval-Augmented Generation (RAG) in Medicine

Recent advancements in LLMs have significantly contributed to enhancing medical question answering (QA) systems. However, challenges such as the generation of inaccurate information ("hallucinations") and the use of outdated knowledge persist, raising concerns particularly in high-stakes fields like healthcare. Retrieval-Augmented Generation (RAG) has emerged as a promising approach to mitigate these issues by grounding LLM responses in relevant, retrieved documents from trustworthy sources. The flexibility inherent in RAG systems, due to their modular nature comprising of retrievers, corpora, and LLM backbones, mandates a comprehensive evaluation to delineate best practices for their implementation in medical contexts.

The Mirage Benchmark and MedRag Toolkit

To address this need for systematic evaluation, the Medical Information Retrieval-Augmented Generation Evaluation (Mirage) benchmark was introduced. Comprising 7,663 questions from five essential medical QA datasets, Mirage facilitates the examination of RAG systems' zero-shot capabilities across various medical question types. Alongside Mirage, a toolkit named MedRag was proposed, offering an accessible means to configure and test different combinations of RAG components, consisting of five distinct corpora, four retrieval algorithms, and six LLMs. This toolkit not only aids in the practical application of RAG systems in medicine but also in conducting large-scale, nuanced analyses to uncover correlations between system configurations and their performance on the benchmark.

Insights from the Evaluation

The evaluation of RAG systems using Mirage surfaced several key findings:

  • A significant enhancement in LLM performance, by up to 18%, was observed when employing RAG over traditional chain-of-thought prompting. Remarkably, certain configurations enabled GPT-3.5 and Mixtral models to rival the performance of their more advanced counterpart, GPT-4.
  • Preference for retrieval corpora varied with the task, highlighting the importance of corpus selection in RAG system configuration. The comprehensive MedCorp corpus, amalgamating multiple sources, emerged as a robust option across tasks, suggesting the value in cross-source retrieval.
  • Among retrievers, domain-specific options like MedCPT showed superior performance in medical contexts. The implementation of fusion methods, such as Reciprocal Rank Fusion, further improved retrieval outcomes by aggregating results from multiple retrievers.
  • The study unveiled scaling properties indicating a log-linear relationship between model performance and the number of retrieved snippets. A "lost-in-the-middle" effect was identified, underscoring the nuanced impact of snippet positioning on answer accuracy.

Future Directions and Recommendations

The extensive analysis provided by the Mirage benchmark and MedRag toolkit lays the groundwork for future research and the refinement of medical RAG systems. Based on the results, several practical recommendations were proposed, including the selection of comprehensive corpora like MedCorp and the employment of domain-specific retrievers, especially in tasks where relevant literature is paramount.

Moreover, the observed performance scaling and snippet positioning effects invite further exploration into the optimization of retrieval depth and order. Additionally, the feasibility of incorporating newer RAG architectures and other potentially beneficial resources into MedRag presents promising avenues for enhancing the model's utility and reliability in medical QA.


In conclusion, the introduction of Mirage and MedRag represents a significant stride towards the optimization of RAG systems for medical question answering. Through systematic benchmarking, this work illuminates the pathways through which RAG configurations can be tailored to maximize accuracy and reliability in medical QA, marking an essential contribution to the field of computational healthcare.

Create an account to read this summary for free:


Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.