Papers
Topics
Authors
Recent
2000 character limit reached

JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability (2402.17887v4)

Published 27 Feb 2024 in cs.CL and cs.IR

Abstract: LLMs have demonstrated a remarkable potential in medical knowledge acquisition and question-answering. However, LLMs can potentially hallucinate and yield factually incorrect outcomes, even with domain-specific pretraining. Previously, retrieval augmented generation (RAG) has limited success in addressing hallucinations. Unlike previous methods in RAG where the retrieval model was trained separately from the LLM, we introduce JMLR (for Jointly trains LLM and information Retrieval) during the fine-tuning phase. The synchronized training mechanism enhances JMLR's ability to retrieve clinical guidelines and leverage medical knowledge to reason and answer questions and reduces the demand for computational resources. We evaluated JMLR on the important medical question-answering application. Our experimental results demonstrate that JMLR-13B (70.5%) outperforms a previous state-of-the-art open-source model using conventional pre-training and fine-tuning Meditron-70B (68.9%) and Llama2-13B with RAG (67.7%) on a medical question-answering dataset. Comprehensive evaluations reveal JMLR-13B enhances reasoning quality and reduces hallucinations better than Claude3-Opus. Additionally, JMLR-13B (148 GPU hours) also trains much faster than Meditron-70B (42630 GPU hours). Through this work, we provide a new and efficient knowledge enhancement method for healthcare, demonstrating the potential of integrating retrieval and LLM training for medical question-answering systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Creating trustworthy llms: Dealing with hallucinations in healthcare ai. arXiv preprint arXiv:2311.01463.
  3. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
  4. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  5. Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
  6. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
  7. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  8. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307.
  9. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
  10. Lift yourself up: Retrieval-augmented text generation with self memory. ArXiv, abs/2305.02437.
  11. Together Computer. 2023. Redpajama: an open dataset for training large language models.
  12. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  13. Parameter-efficient fine-tuning of llama for the clinical domain. arXiv preprint arXiv:2307.03042.
  14. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23.
  15. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  16. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247.
  17. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  18. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  19. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
  20. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  21. Mimic-iv. PhysioNet. Available online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021).
  22. Publicly shareable clinical large language model built on synthetic clinical notes. arXiv preprint arXiv:2309.00237.
  23. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373.
  24. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  25. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  26. Ra-dit: Retrieval-augmented dual instruction tuning. arXiv preprint arXiv:2310.01352.
  27. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747.
  28. Nonparametric masked language modeling. arXiv preprint arXiv:2212.01349.
  29. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR.
  30. A study of generative large language model for medical research and healthcare. arXiv preprint arXiv:2305.13523.
  31. Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
  32. Improving language understanding by generative pre-training.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  34. MDDavid A Ehrmann Robert L Barbieri and Kathryn A Martin William F Crowley. 2024. Diagnosis of polycystic ovary syndrome in adults. UpToDate, Connor RF (Ed), Wolters Kluwer.
  35. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  36. Replug: Retrieval-augmented black-box language models. ArXiv, abs/2301.12652.
  37. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  38. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
  39. Large language models encode clinical knowledge. Nature, 620(7972):172–180.
  40. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
  41. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint.
  42. Can language models be biomedical knowledge bases? arXiv preprint arXiv:2109.07154.
  43. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031.
  44. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  45. Bioinstruct: Instruction tuning of large language models for biomedical natural language processing. arXiv preprint arXiv:2310.19975.
  46. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:2310.07521.
  47. Notechat: A dataset of synthetic doctor-patient conversations conditioned on clinical notes. arXiv preprint arXiv:2310.15959.
  48. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  49. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  50. Pmc-llama: Towards building open-source language models for medicine. arXiv preprint arXiv:2305.10415, 6.
  51. Benchmarking retrieval-augmented generation for medicine.
  52. Performance of multimodal gpt-4v on usmle with image: Potential for imaging diagnostic support with explanations. medRxiv, pages 2023–10.
  53. Context variance evaluation of pretrained language models for prompt-based biomedical knowledge probing. arXiv preprint arXiv:2211.10265.
  54. Deep bidirectional language-knowledge graph pretraining. Advances in Neural Information Processing Systems, 35:37309–37323.
  55. Linkbert: Pretraining language models with document links. arXiv preprint arXiv:2203.15827.
  56. Wenhao Yu. 2022. Retrieval-augmented generation across heterogeneous knowledge. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, pages 52–58.
  57. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070.
  58. Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075.
  59. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107.
Citations (20)

Summary

  • The paper presents JMLR, a method that synchronizes LLM and retriever training to enhance reasoning and reduce hallucinations in medical question-answering.
  • It employs a dual-parameter optimization with a unique LLM-Rank loss to improve document relevance and answer quality.
  • Experimental results across multiple medical datasets show JMLR outperforming larger models with far lower computational costs.

Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability

This paper introduces JMLR, a novel approach to improving the accuracy and reasoning capabilities of medical LLMs by jointly training LLMs and information retrieval systems. The paper demonstrates JMLR's effectiveness in enhancing medical question-answering systems by reducing the hallucinations common in LLM outputs and optimizing computational efficiency.

Introduction to JMLR

JMLR stands out by integrating the training of retrievers with LLMs, deviating from traditional methods where these components are trained separately. Conventional approaches involve training a retriever to fetch relevant documents and fine-tuning an LLM based on retrieved data (Figure 1). JMLR synchronizes the training of the retriever and the LLM, thus aligning their operations to jointly improve the model's performance in medical question-answering tasks. Figure 1

Figure 1: Comparison between different domain adaptation methods: traditional domain pretraining method (left), RAG (middle), and JMLR (right). JMLR retrieves the documents to reduce the hallucination.

Implementation Strategy

The paper details the architectural setup for JMLR, emphasizing its dual-parameter optimization process. The optimization strategy involves simultaneously updating both retriever and LLM parameters using a gradient descent approach based on a combined loss function, ensuring the retriever identifies contextually helpful documents that enhance LLM responses.

The implementation leverages the LLM-Rank loss system, a unique mechanism for assessing the impact of retrieved documents on the LLM's answer quality. This allows for the retriever's prioritization of documents based on their utility in improving LLM performance.

Experimental Setup and Results

The authors conducted extensive experiments across multiple datasets, including MMLU-Medical, MedMCQA, MedQA, and Amboss, demonstrating JMLR's superior performance in medical question-answering tasks compared to existing state-of-the-art models like Meditron and ChatGPT. Figure 2

Figure 2: JMLR achieved the highest average accuracy across the MMLU-Medical, MedMcQA, MedQA, and Amboss datasets, utilizing only 148 GPU hours.

The results indicate that JMLR models, with parameters ranging up to 13 billion, consistently outperform larger models such as Meditron 70B in accuracy and computational efficiency. JMLR reduces training time significantly, requiring only 148 GPU hours compared to Meditron's 42630 GPU hours.

Discussion and Implications

The success of JMLR underscores the importance of integrating document retrieval directly into LLM training processes, especially in domains where accuracy is paramount, like healthcare. By reducing hallucination and enhancing retrieval effectiveness, JMLR offers a computationally efficient solution for deploying medical question-answering systems.

The implications of this approach are vast, potentially transforming how AI supports clinical decision-making by providing reliable, contextually grounded answers, thereby advancing the accessibility of crucial medical knowledge.

Conclusion

JMLR represents a significant advancement in combining retrieval mechanisms with LLM training, showcasing improvements in accuracy, reasoning capabilities, and efficiency. This research opens avenues for further exploration into synchronous training mechanisms in LLMs, with potential applications extending beyond medical domains, relying on robust information retrieval architectures to combat hallucination effects effectively.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.