ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation (2306.09968v1)

Published 16 Jun 2023 in cs.CL

Abstract: LLMs have exhibited exceptional performance on various NLP tasks, leveraging techniques such as the pre-training, and instruction fine-tuning. Despite these advances, their effectiveness in medical applications is limited, due to challenges such as factual inaccuracies, reasoning abilities, and lack grounding in real-world experience. In this study, we present ClinicalGPT, a LLM explicitly designed and optimized for clinical scenarios. By incorporating extensive and diverse real-world data, such as medical records, domain-specific knowledge, and multi-round dialogue consultations in the training process, ClinicalGPT is better prepared to handle multiple clinical task. Furthermore, we introduce a comprehensive evaluation framework that includes medical knowledge question-answering, medical exams, patient consultations, and diagnostic analysis of medical records. Our results demonstrate that ClinicalGPT significantly outperforms other models in these tasks, highlighting the effectiveness of our approach in adapting LLMs to the critical domain of healthcare.

References (24)

Citations (68)

View on Semantic Scholar

Summary

The paper presents ClinicalGPT, a domain-specific LLM fine-tuned with diverse clinical data to boost diagnostic accuracy and exam performance.
The methodology employs T5-based text generation enhanced with instruction-tuning, LoRA, and reinforcement learning for efficient, human-aligned responses.
Comprehensive evaluations show ClinicalGPT outperforming peer models in medical conversations, examinations, diagnostic tasks, and question answering.

ClinicalGPT: Advancements in Medical LLMs

The paper "ClinicalGPT: LLMs Finetuned with Diverse Medical Data and Comprehensive Evaluation" introduces a specialized LLM named ClinicalGPT, aimed at addressing the unique challenges posited by medical domain applications. The paper highlights that while LLMs like BERT, GPT-3, and PALM have showcased robust performance across various NLP tasks, their effectiveness in medical contexts remains constrained by factual inaccuracies, reasoning deficits, and insufficient real-world grounding.

Methodology

ClinicalGPT is constructed on the foundation of BLOOM-7B, selected for its open-source status and multilingual support. The training of ClinicalGPT differs significantly from general-purpose LLMs in that it incorporates extensive real-world medical datasets, including cMedQA2, cMedQA-KG, MD-EHR, MEDQA-MCMLE, and MedDialog. These datasets encompass Chinese medical Q&A forums, knowledge graph-derived question-answer pairs, medical exam questions, multi-turn dialogues mimicking real doctor-patient interactions, and comprehensive electronic health records.

The finetuning process for ClinicalGPT employs the T5 model’s text generation strategy, enhanced with instruction-tuning using supervised fine-tuning (SFT) and parameter-efficient fine-tuning methods such as LoRA to improve computational efficiency. Furthermore, an innovative reinforcement learning (RL) framework, integrating a trained reward model, is employed to refine ClinicalGPT with human feedback loop, thus aligning its outputs with the expectations of medical practitioners.

Evaluation

The evaluation of ClinicalGPT was robust and multifaceted, assessing its performance across four primary tasks: medical conversation, medical examinations, diagnosis, and medical question answering. The model's performance was benchmarked against other prevalent fine-tuned models, namely ChatGLM-6B, LLAMA-7B, and BLOOM-7B.

Medical Conversation: The evaluation utilized metrics such as BLEU, ROUGE, and GLEU to assess the quality of model-generated dialogue in MedDialog’s test set. ClinicalGPT excelled in BLEU-1 and ROUGE metrics, indicating its capability to generate comprehensive and contextually relevant responses.
Medical Examination: The model was tested using the MEDQA-MCMLE dataset across categories like ethics, respiratory, and digestive systems. ClinicalGPT surpassed comparative models with an average accuracy of 38.4%, particularly excelling in rheumatic immune diseases.
Diagnosis: On the MD-EHR dataset, ClinicalGPT's diagnostic accuracy was evaluated. Notably, it achieved a commendable average accuracy of 80.9%, significantly outperforming its competitors.
Medical Question Answering: Using a subset of cMedQA2, ClinicalGPT was evaluated by GPT-4 on accuracy, helpfulness, and safety. ClinicalGPT showed superior performance, winning in the majority of comparisons against other LLMs like BLOOM-7B and LLAMA-7B.

Implications

ClinicalGPT’s development signifies an important step forward in integrating AI within the medical field. The model's ability to process and generate accurate, relevant medical information has practical applications in clinical decision support, patient interaction, and health data management. Theoretically, it highlights the potential for LLMs to evolve as domain-specific experts, offering reliable support in complex fields like medicine.

The success of ClinicalGPT could stimulate further research into domain-specific finetuning of LLMs, leveraging diverse data sources and advanced training methodologies to overcome the limitations of general-purpose models. Future developments may focus on expanding ClinicalGPT’s linguistic and cultural adaptability, refinement of reasoning capabilities, and the integration of additional safety measures to further enhance its reliability in clinical settings.

PDF Markdown

Related Papers

Tweets

https://twitter.com/xsrinikar/status/1840339576587510054

YouTube

Show All Videos