Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models
Abstract: Recent proprietary LLMs, such as GPT-4, have achieved a milestone in tackling diverse challenges in the biomedical domain, ranging from multiple-choice questions to long-form generations. To address challenges that still cannot be handled with the encoded knowledge of LLMs, various retrieval-augmented generation (RAG) methods have been developed by searching documents from the knowledge corpus and appending them unconditionally or selectively to the input of LLMs for generation. However, when applying existing methods to different domain-specific problems, poor generalization becomes apparent, leading to fetching incorrect documents or making inaccurate judgments. In this paper, we introduce Self-BioRAG, a framework reliable for biomedical text that specializes in generating explanations, retrieving domain-specific documents, and self-reflecting generated responses. We utilize 84k filtered biomedical instruction sets to train Self-BioRAG that can assess its generated explanations with customized reflective tokens. Our work proves that domain-specific components, such as a retriever, domain-related document corpus, and instruction sets are necessary for adhering to domain-related instructions. Using three major medical question-answering benchmark datasets, experimental results of Self-BioRAG demonstrate significant performance gains by achieving a 7.2% absolute improvement on average over the state-of-the-art open-foundation model with a parameter size of 7B or less. Overall, we analyze that Self-BioRAG finds the clues in the question, retrieves relevant documents if needed, and understands how to answer with information from retrieved documents and encoded knowledge as a medical expert does. We release our data and code for training our framework components and model weights (7B and 13B) to enhance capabilities in biomedical and clinical domains.
- Asai, A. et al. (2023). Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511.
- Bajaj, P. et al. (2016). Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
- Cao, M. et al. (2022). Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
- Chen, Z. et al. (2023). Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
- Christiano, P. F. et al. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems.
- Chung, H. W. et al. (2022). Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Dao, T. et al. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems.
- Fang, Y. et al. (2023). Mol-instructions: A large-scale biomolecular instruction dataset for large language models. arXiv preprint arXiv:2306.08018.
- Guo, G. et al. (2003). Knn model-based approach in classification. In On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, November 3-7, 2003. Proceedings.
- Guu, K. et al. (2020). Retrieval augmented language model pre-training. In International conference on machine learning.
- Hendrycks, D. et al. (2020). Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Izacard, G. et al. (2022a). Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
- Izacard, G. et al. (2022b). Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
- Ji, Z. et al. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys.
- Jiang, Z. et al. (2023). Active retrieval augmented generation. arXiv preprint arXiv:2305.06983.
- Jin, D. et al. (2021). What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences.
- Jin, Q. et al. (2019). Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
- Jin, Q. et al. (2023). Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval. Bioinformatics (Oxford, England).
- Kang, M. et al. (2023). Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks. arXiv preprint arXiv:2305.18395.
- Karpukhin, V. et al. (2020). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Kitaev, N. et al. (2019). Multilingual constituency parsing with self-attention and pre-training. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Kwon, W. et al. (2023). Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles.
- Lewis, P. et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems.
- Li, B. et al. (2023). Meddm: Llm-executable clinical guidance tree for clinical decision-making. arXiv preprint arXiv:2312.02441.
- Mao, Y. et al. (2021). Generation-augmented retrieval for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
- Nori, H. et al. (2023). Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375.
- OpenAI (2023a). Chatgpt.
- OpenAI (2023b). Openai gpt-4 technical report.
- Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems.
- Pal, A. et al. (2022). Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning.
- Paszke, A. et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems.
- Rajbhandari, S. et al. (2020). Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE.
- Schulman, J. et al. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Shao, Z. et al. (2023). Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294.
- Singhal, K. et al. (2022). Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
- Taori, R. et al. (2023). Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html.
- Taylor, R. et al. (2022). Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
- Google (2023). Bard.
- Touvron, H. et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Wang, Y. et al. (2022). Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Wang, Y. et al. (2023). Augmenting black-box llms with medical textbooks for clinical question answering. arXiv preprint arXiv:2309.02233.
- Wang, Z. et al. (2019). Multi-passage bert: A globally normalized bert model for open-domain question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
- Wei, J. et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems.
- Wolf, T. et al. (2019). Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Wu, Z. et al. (2023). Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693.
- Zhang, X. et al. (2023). Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of what remains missing, uncertain, or unexplored in the paper, formulated to guide future research.
- Lack of clinical/medical expert human evaluation: No clinician ratings of factuality, safety, usefulness, and potential harm, especially for long-form answers where automatic metrics are known to be insufficient.
- Inadequate evaluation of hallucination and faithfulness: No direct measurement of hallucination rates, attribution correctness, or evidence–answer alignment beyond reflective tokens; absence of claim-level verification or groundedness metrics.
- Limited long-form QA assessment: Reliance on ROUGE and BERTScore (acknowledged as insufficient); no task-specific factuality, coverage, calibration, or discourse-coherence evaluations.
- No analysis of explanation faithfulness: The model produces rationales, but their faithfulness to evidence (vs. post-hoc plausible explanations) is not tested.
- Missing human validation of reflective-token labels: GPT-4–generated supervision for reflective tokens is not auditor-validated; label quality, bias, and consistency are unknown.
- Reflective token design underexplored: Only four generic tokens (RET/REL/SUP/USE) are used; no investigation of domain-specific reflective tokens tailored to biomedical reasoning or clinical safety.
- Sensitivity to decoding hyperparameters untested: The weights for reflective tokens (wG) and the retrieval threshold δ are borrowed from prior work; no task-wise tuning or sensitivity analysis is provided.
- Limited retrieval strategy: Retrieval appears single-step with selection of a single “best” evidence; multi-hop retrieval, aggregation of multiple passages, and iterative retrieve-then-read strategies are not explored.
- Reranker details and impact unclear: The reranking module architecture, training data, and its ablation impact are not described, leaving its contribution and robustness uncertain.
- Fixed chunking and k without justification: Chunk size (128 words/32 overlap) and top-k (=10) are fixed; no study of how chunking granularity or k affects recall, precision, and downstream performance.
- Source-specific contributions not quantified: Although retrieval ratios per source are reported, there is no causal analysis of how each corpus (PubMed, PMC, CPG, textbooks) individually impacts accuracy or error profiles.
- Domain coverage and out-of-distribution robustness: Generalization to underrepresented biomedical subdomains, evolving guidelines, newly approved drugs, and rare diseases is not assessed.
- Temporal robustness and retraction handling: No mechanism or evaluation for knowledge freshness, retraction detection, or update cadence of the index in a rapidly changing biomedical literature.
- Real-world clinical text and multilingual settings: The approach is not evaluated on clinical notes/EHRs or non-English corpora/questions; privacy, de-identification, and multilingual retrieval are unaddressed.
- Safety, bias, and fairness analyses absent: No assessment of demographic biases, differential performance across subpopulations, or safeguards against harmful recommendations.
- Data contamination safeguards unspecified: The retrieval corpus includes medical textbooks that may overlap with benchmark sources (e.g., MedQA/USMLE-style content); no contamination checks or lineage audits are reported.
- Computational cost and latency unreported: Index size (>560 GB of embeddings), retrieval latency, memory footprint, and end-to-end serving costs are not quantified; feasibility in resource-constrained settings is unclear.
- Comparative baselines could be stronger: No comparison with state-of-the-art biomedical RAG stacks (e.g., BM25+cross-encoder rerankers, advanced dense retrievers, or hybrid ensembles) or larger open models with retrieval.
- Limited task breadth in evaluation: Although instruction sets include IE, summarization, and classification, evaluations are restricted to MCQ and long-form QA; transfer to other biomedical NLP tasks is untested.
- No uncertainty estimation or calibration: The system does not report confidence or abstain under uncertainty; how reflective tokens correlate with calibrated correctness is unknown.
- Failure mode and error taxonomy missing: No qualitative or quantitative error analysis by question type, reasoning step, or retrieval failure, limiting actionable insight for improvement.
- Effects of instruction filtering unclear: The critic-filtered 84k dataset might introduce selection bias (e.g., discarding harder or ambiguous cases); impact on generalization is not analyzed.
- Dependence on proprietary GPT-4 for supervision: The reliance on GPT-4 to produce initial reflective-token labels raises reproducibility concerns and potential implicit leakage from closed-source training data.
- Joint training of critic and generator unexplored: The pipeline trains critic then generator; end-to-end or co-training strategies, or using learned rewards for preference optimization, are not investigated.
- Evidence conflict handling unaddressed: The model’s behavior when retrieved documents disagree (e.g., conflicting guidelines) is not studied; no mechanism for consensus or source prioritization.
- Input length constraints and multi-evidence fusion: While RAG baselines are limited by context length, the proposed method’s own constraints, memory trade-offs, and multi-passage fusion strategies are not examined.
- Ethical, regulatory, and deployment considerations: Pathways for clinical validation, alignment with guidelines (e.g., FDA, MHRA), and human-in-the-loop workflows are not discussed.
Collections
Sign up for free to add this paper to one or more collections.