Benchmarking Retrieval-Augmented Generation for Medicine (2402.13178v2)
Abstract: While LLMs have achieved state-of-the-art performance on a wide range of medical question answering (QA) tasks, they still face challenges with hallucinations and outdated knowledge. Retrieval-augmented generation (RAG) is a promising solution and has been widely adopted. However, a RAG system can involve multiple flexible components, and there is a lack of best practices regarding the optimal RAG setting for various medical purposes. To systematically evaluate such systems, we propose the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets. Using MIRAGE, we conducted large-scale experiments with over 1.8 trillion prompt tokens on 41 combinations of different corpora, retrievers, and backbone LLMs through the MedRAG toolkit introduced in this work. Overall, MedRAG improves the accuracy of six different LLMs by up to 18% over chain-of-thought prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. Our results show that the combination of various medical corpora and retrievers achieves the best performance. In addition, we discovered a log-linear scaling property and the "lost-in-the-middle" effects in medical RAG. We believe our comprehensive evaluations can serve as practical guidelines for implementing RAG systems for medicine.
- Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 370–379.
- Construction of the literature graph in semantic scholar. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 84–91.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Sofia J Athenikos and Hyoil Han. 2010. Biomedical question answering: A survey. Computer methods and programs in biomedicine, 99(1):1–24.
- Asma Ben Abacha and Dina Demner-Fushman. 2019. A question-entailment approach to question answering. BMC bioinformatics, 20(1):1–23.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
- Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations. arXiv preprint arXiv:2305.16326.
- Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
- Specter: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282.
- Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 758–759.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- How user intelligence is improving pubmed. Nature biotechnology, 36(10):937–945.
- Bioreader: a retrieval-enhanced text-to-text transformer for biomedical literature. In Proceedings of the 2022 conference on empirical methods in natural language processing, pages 5770–5793.
- Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
- Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. arXiv preprint arXiv:2401.15269.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Active retrieval augmented generation. arXiv preprint arXiv:2305.06983.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
- Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577.
- Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39(11):btad651.
- Retrieve, summarize, and verify: How will chatgpt impact information seeking from the medical literature? Journal of the American Society of Nephrology, pages 10–1681.
- Pubmed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine, 100.
- Biomedical question answering: a survey of approaches and challenges. ACM Computing Surveys (CSUR), 55(2):1–36.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781.
- Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data, 10(1):170.
- Paperqa: Retrieval-augmented generative agent for scientific research. arXiv preprint arXiv:2312.07559.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143.
- Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2356–2362.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
- Zhiyong Lu. 2011. Pubmed and beyond: a survey of web tools for searching biomedical literature. Database, 2011:baq036.
- Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6):bbac409.
- Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
- Literature-augmented clinical outcome prediction. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 438–453.
- Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375.
- Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452.
- Gpt-4 technical report.
- Neighborhood contrastive learning for scientific document representations with citation embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11670–11688.
- Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR.
- In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083.
- Biolord: Learning ontological representations from definitions for biomedical concepts and their textual descriptions. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1454–1465.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180.
- Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
- Sarvesh Soni and Kirk Roberts. 2020. Evaluation of dataset selection for pre-training and fine-tuning transformer language models for clinical question answering. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5532–5538.
- Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
- BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings in Bioinformatics, 25(1):bbad493.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Large language models should be used as scientific reasoning engines, not knowledge databases. Nature medicine, 29(12):2983–2984.
- An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):1–28.
- Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Ccnet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012.
- Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454.
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations.
- Gatortron: A large clinical language model to unlock patient information from unstructured electronic health records. arXiv preprint arXiv:2203.03540.
- Deep bidirectional language-knowledge graph pretraining. Advances in Neural Information Processing Systems, 35:37309–37323.
- Almanac—retrieval-augmented language models for clinical medicine. NEJM AI, 1(2):AIoa2300068.
- Pierre Zweigenbaum. 2003. Question answering in biomedicine. In Proceedings Workshop on Natural Language Processing for Question Answering, EACL, volume 2005, pages 1–4. Citeseer.