Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 49 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 172 tok/s Pro
GPT OSS 120B 472 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation (2306.09968v1)

Published 16 Jun 2023 in cs.CL

Abstract: LLMs have exhibited exceptional performance on various NLP tasks, leveraging techniques such as the pre-training, and instruction fine-tuning. Despite these advances, their effectiveness in medical applications is limited, due to challenges such as factual inaccuracies, reasoning abilities, and lack grounding in real-world experience. In this study, we present ClinicalGPT, a LLM explicitly designed and optimized for clinical scenarios. By incorporating extensive and diverse real-world data, such as medical records, domain-specific knowledge, and multi-round dialogue consultations in the training process, ClinicalGPT is better prepared to handle multiple clinical task. Furthermore, we introduce a comprehensive evaluation framework that includes medical knowledge question-answering, medical exams, patient consultations, and diagnostic analysis of medical records. Our results demonstrate that ClinicalGPT significantly outperforms other models in these tasks, highlighting the effectiveness of our approach in adapting LLMs to the critical domain of healthcare.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  2. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.
  3. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  6. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  7. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  8. Christian Baumgartner. The potential impact of chatgpt in clinical and translational medicine. Clinical and translational medicine, 13(3), 2023.
  9. Tyler Cowen. The ai revolution in medicine: Gpt-4 and beyond. 2023.
  10. Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265, 2023.
  11. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.
  12. Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access, 6:74061–74071, 2018.
  13. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
  14. Meddialog: Two large-scale medical dialogue datasets. arXiv preprint arXiv:2004.03329, 2020.
  15. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  16. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  17. Bloom: A 176b-parameter open-access multilingual language model, 2023.
  18. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  19. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  20. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  21. Llama: Open and efficient foundation language models, 2023.
  22. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  24. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
Citations (68)

Summary

  • The paper presents ClinicalGPT's fine-tuning of BLOOM-7B using supervised and reinforcement learning to enhance clinical accuracy.
  • It leverages diverse datasets such as cMedQA2, MedDialog, and MD-EHR to optimize medical dialogues and examination performance.
  • It achieves high BLEU, ROUGE scores and an 80.9% diagnostic accuracy, underscoring robust performance in critical medical tasks.

ClinicalGPT: Fine-tuning LLMs for Medical Applications

LLMs have shown impressive capabilities across various NLP tasks. However, their application in the medical domain faces significant challenges due to the need for high accuracy, reasoning abilities, and real-world grounding. This essay discusses "ClinicalGPT: LLMs Finetuned with Diverse Medical Data and Comprehensive Evaluation" (2306.09968), which presents ClinicalGPT, an LLM specifically optimized for clinical tasks through extensive fine-tuning using diverse medical datasets.

Methodology

Dataset Composition

ClinicalGPT was trained using a wide range of datasets such as cMedQA2, cMedQA-KG, MD-EHR, MEDQA-MCMLE, and MedDialog. These datasets contributed various forms of medical data:

  • cMedQA2: A Chinese medical question-answer dataset.
  • cMedQA-KG: Developed from knowledge graphs focused on disease, medication, and symptom entities.
  • MEDQA-MCMLE: Encompasses multiple-choice medical exam questions.
  • MedDialog: Includes multi-turn dialogues resembling real doctor-patient interactions.
  • MD-EHR: Electronic Health Records dataset covering various disease groups.

Fine-tuning Process

The training strategy entailed a robust instruction-tuning approach with Supervised Fine Tuning (SFT) and Reinforcement Learning (RL). Using the BLOOM-7B LLM as the base, ClinicalGPT was fine-tuned to generate accurate and coherent medical texts. This approach involved:

  • Supervised Fine Tuning: Optimizing the likelihood generation of responses based on input prompts.
  • Reinforcement Learning: Employing a reward model to enhance output quality, optimized with Proximal Policy Optimization (PPO) to manage divergence from initial states. Figure 1

    Figure 1: The overview of ClinicalGPT.

Experimental Results

Medical Conversations

ClinicalGPT demonstrated superior performance in generating relevant and coherent responses in medical dialogues, as indicated by metrics such as BLEU and ROUGE. Specifically, it achieved the highest BLEU-1 and ROUGE scores, highlighting its proficiency in context preservation and informativeness.

Medical Examinations

When tested on the MEDQA-MCMLE dataset, ClinicalGPT outperformed other models across eight medical categories, particularly excelling in Rheumatic immune conditions. This indicates its strong clinical reasoning and knowledge integration capabilities for medical exams.

Diagnosis

In diagnosing conditions within the MD-EHR dataset, ClinicalGPT exceeded other models with an average accuracy of 80.9%. It displayed notable competence in the Digestive and Urinary categories, although there is room for improvement in Gynecology and Hematology.

Medical Question Answering

Using an evaluation benchmark with GPT-4, ClinicalGPT consistently surpassed BLOOM-7B, LLAMA-7B, and ChatGLM-6B in accuracy, helpfulness, and safety of responses. This confirms ClinicalGPT’s ability to generate reliable medical advice and information.

Conclusion

The development of ClinicalGPT highlights the potential of tailoring LLMs specifically for medical applications through comprehensive fine-tuning with clinically relevant data. The model demonstrates enhanced performance across various medical tasks, establishing a robust framework for deploying LLMs in clinical settings. Although ClinicalGPT shows promise in transforming medical AI applications, ongoing fine-tuning and adaptation are necessary to address specific domains and datasets, augmenting its capabilities as medical AI continues to evolve.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com