From Beginner to Expert: Modeling Medical Knowledge into General LLMs (2312.01040v3)
Abstract: Recently, LLM based AI systems have demonstrated remarkable capabilities in natural language understanding and generation. However, these models face a significant challenge when it comes to sensitive applications, such as reasoning over medical knowledge and answering medical questions in a physician-like manner. Prior studies attempted to overcome this challenge by increasing the model size (>100B) to learn more general medical knowledge, while there is still room for improvement in LLMs with smaller-scale model sizes (<100B). In this work, we start from a pre-trained general LLM model (AntGLM-10B) and fine-tune it from a medical beginner towards a medical expert (called AntGLM-Med-10B), which leverages a 3-stage optimization procedure, i.e., general medical knowledge injection, medical domain instruction tuning, and specific medical task adaptation. Our contributions are threefold: (1) We specifically investigate how to adapt a pre-trained general LLM in medical domain, especially for a specific medical task. (2) We collect and construct large-scale medical datasets for each stage of the optimization process. These datasets encompass various data types and tasks, such as question-answering, medical reasoning, multi-choice questions, and medical conversations. (3) Specifically for multi-choice questions in the medical domain, we propose a novel Verification-of-Choice approach for prompting engineering, which significantly enhances the reasoning ability of LLMs. Remarkably, by combining the above approaches, our AntGLM-Med-10B model can outperform the most of LLMs on PubMedQA, including both general and medical LLMs, even when these LLMs have larger model size.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (big picture)
This paper shows how to teach a general-purpose AI LLM to become much better at medical questions. The team starts with a regular LLM called AntGLM-10B and turns it into a medical specialist called AntGLM-Med-10B. They do this in three steps that are a lot like school: first the model “reads” medical knowledge, then it practices following medical instructions, and finally it trains hard on a specific kind of medical test. The goal is to make a smaller model (10 billion parameters) perform like a medical expert, even compared to larger models.
What the researchers wanted to find out
Here are the main questions they asked, in simple terms:
- How can we turn a general AI into a medical expert without making it huge?
- What training steps help most for medical skills: general reading, instruction practice, or focused test prep?
- Can smart prompting (like checking each multiple-choice option carefully) make the AI reason better on medical exams?
- Can a smaller, well-trained model beat bigger ones on tough medical benchmarks (like PubMedQA)?
How they trained the model (the approach, explained simply)
Think of the model as a student going from beginner to expert. The team used a three-stage plan:
- Stage 1: General Medical Knowledge Injection
- Analogy: “Reading the textbook.”
- The model keeps pre-training on lots of medical text to learn the basics. This included:
- Medical books (clear, structured knowledge)
- Knowledge graphs (like a big map of medical facts turned into sentences)
- Medical Q&A pairs (real consultations)
- Exam questions (converted into readable facts)
- Medical articles (like PubMed abstracts and science articles)
- Goal: Build a strong medical foundation.
- Stage 2: Medical Domain Instruction Tuning
- Analogy: “Practicing with a teacher’s instructions.”
- The model learns to follow medical task instructions across many formats:
- Answering questions, reasoning, multiple-choice, and conversations
- This helps it understand how to respond properly to different medical prompts.
- Stage 3: Specific Medical Task Adaptation
- Analogy: “Focused test prep for a particular exam.”
- They target multiple-choice medical questions, especially the PubMedQA test (questions about medical research papers that have answers like yes/no/maybe).
- They introduce a new prompting method called Verification-of-Choice (VoC):
- CoT (Chain-of-Thought) is like “show your work” for one answer.
- CoVE (Chain-of-Verification) is like “double-check your answer.”
- VoC goes further for multiple-choice: the model writes its reasoning for each choice (A, B, C), compares them, spots mistakes or mismatches, and then picks the best one. It’s like a student explaining every option before selecting the answer.
- They also use:
- LoRA/adapters: think “plug-in skills” that are small add-ons instead of retraining the whole brain, making training faster and cheaper.
- C-Poly (multi-task adapters): like having shared skill modules plus task-specific modules so the model can learn common tricks across tasks and still specialize.
- Perplexity ranking: a “how surprised am I?” score the model uses to pick the most likely choice.
They also cleverly labeled unlabeled training data (PQA-U) by having the model generate answers using VoC—like making “practice tests” with answer keys to learn from.
What they found and why it matters
Here are the standout results:
- Big accuracy jump across stages:
- Before any medical training, the model scored about 57.2% on PubMedQA.
- After the full three-stage process, it reached 80.6%.
- This shows each step added real value, especially the final, focused adaptation.
- Competitive performance with a smaller model:
- AntGLM-Med-10B (10B parameters) scored 80.6% on PubMedQA.
- That’s close to the very best models (like Med-PaLM 2 at 81.8%) and better than many larger models.
- Lesson: Smart training can beat raw size.
- Verification-of-Choice helps:
- Using VoC to label the unlabeled data improved results further.
- In short, explaining and checking each option before choosing boosts reasoning accuracy.
- Multi-task training with adapters (C-Poly) works well:
- Training on several related datasets together helped more than training on one at a time.
- Shared “common skills” plus “task-specific skills” made learning more efficient.
Why this is important: It shows you don’t need the biggest model to get expert-level performance if you train cleverly. This can make high-quality medical AI more accessible and efficient.
What this could mean for the future
- Better medical AI with fewer resources: Hospitals, clinics, and researchers could run strong models without always needing giant, expensive systems.
- Safer, more reliable reasoning: Techniques like Verification-of-Choice encourage the model to think carefully, not just guess—useful in sensitive fields like medicine.
- A roadmap for other expert fields: The same three-stage approach (read widely, practice instructions, then specialize) could help build expert AIs in law, finance, or engineering.
- Still, use with care: Even with strong results, medical AI should support—not replace—health professionals. Testing, supervision, and ethics remain crucial.
In short, this paper presents a smart training plan that turns a general AI into a capable medical assistant, showing that better “paper habits” can rival sheer size.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of what remains missing, uncertain, or unexplored in the paper, written to be concrete and actionable for future research.
- Evaluation scope is narrow:
- Results focus primarily on PubMedQA; comprehensive benchmarks (e.g., MedQA-USMLE, MedMCQA, MMLU-Clinical, MultiMedQA) are not reported with standard splits, hindering claims about broad medical competence.
- No real-world, clinician-authored, or workflow-oriented evaluations (e.g., case vignettes, differential diagnosis, triage, guideline adherence).
- Reproducibility and evaluation protocol clarity:
- Inconsistent PubMedQA results (80.6% “main result” vs. 88%+ in fine-tuning table) with unclear splits, seeds, and whether the official test set and evaluation server were used.
- Several ablations use 500-sample subsets without specifying sampling protocols, seeds, or whether they match standard dev/test splits.
- Lack of full details on prompt templates, hyperparameters per stage, early stopping criteria, and selection of best checkpoints.
- Potential data contamination and leakage:
- PubMed abstracts are used for pretraining; only PQA-L abstracts are excluded, but possible overlaps with other PubMedQA subsets or evaluation-like data are not systematically deduplicated or reported.
- PQA-U pseudo-labels are generated using the dataset’s “long answers,” which may encode answer cues; deduplication and leakage audits (hashing, n-gram overlap, semantic similarity) are not provided.
- Quality and ethics of training data:
- Synthetic rewrites (knowledge graphs, exam questions) lack quality control studies (factual accuracy, consistency, bias).
- “Purchased professional medical articles” introduce licensing, usage rights, and provenance questions; no licensing or ethical review details provided.
- Real-world QA pairs may contain sensitive information; privacy protection, de-identification, and IRB/ethics approvals are not discussed.
- Cross-lingual generalization is untested:
- Training leverages extensive Chinese and English data, but performance on Chinese medical benchmarks (e.g., CMB, C-Eval medical tracks, CMExam, Huatuo-related test suites) is not reported, leaving bilingual transfer unknown.
- Generalization beyond multiple-choice is unproven:
- The approach and VoC prompting are tailored to multiple-choice; effectiveness on open-ended, long-form, or evidence-grounded medical QA (e.g., justification with citations) remains untested.
- No evaluation on tasks like summarization of clinical notes, guideline recommendation synthesis, or SOAP note generation.
- Verification-of-Choice (VoC) analysis is limited:
- No direct comparison with strong reasoning baselines (self-consistency, majority vote over CoT samples, debate, verifiers/critics, PoT/program-of-thought, tool-augmented verification).
- Computational overhead (latency, token budget) of VoC is not quantified; cost–benefit trade-offs are unknown.
- VoC’s robustness with >3 options, longer contexts, and adversarial distractors is untested.
- No analysis of VoC failure modes (e.g., confirmation bias, inconsistent rationales).
- Hallucination and faithfulness not rigorously measured:
- Claims about reduced hallucination via verification are not supported by standardized metrics (e.g., fact-score, attribution fidelity) or third-party fact-checking benchmarks.
- No tests of rationale faithfulness (e.g., causal scrubbing, input perturbation studies) for CoT/VoC explanations.
- Uncertainty and calibration remain open:
- Perplexity-based selection is used ad hoc without calibration evaluation (ECE, Brier score), selective prediction, or abstention strategies for safety-critical settings.
- Safety, harm, and bias:
- No safety alignment (RLHF with clinical constraints, constitutional health policies), bias/disparity audits, or toxicity/unsafe advice evaluations.
- No adverse event simulation or harm-reduction protocols; no mechanism for refusal/triage when uncertain.
- Catastrophic forgetting and retained general abilities:
- The impact of medical specialization on general-domain capabilities is unmeasured (pre/post comparisons on general benchmarks like MMLU, HELM, BIG-Bench, or GLUE/SuperGLUE).
- Multi-task adapter method (C-Poly) limitations:
- Authors note that C-Poly’s router “cannot index and effectively predict untrained unknown tasks”; generalization and out-of-distribution task routing remain unresolved.
- No comparison with alternative MTL/PEFT methods (e.g., mixture-of-adapters, prompt-tuning, IA3, (IA)3, LoRA variants with routing, sparsely gated MoE) under equal compute.
- Scaling and data ablations are missing:
- No systematic scaling paper across model sizes (e.g., 7B/13B/34B) or data sizes per corpus; contributions of each corpus (books, KGs, exams, QA, articles) to final performance remain opaque.
- No paper of continual pretraining duration vs. gains or instruction-tuning mixture composition and sampling strategies.
- Training efficiency and environmental impact:
- Wall-clock time, total tokens seen, compute budget, and energy usage are not reported; implications for reproducibility and sustainability are unclear.
- Error analysis is absent:
- No breakdown by question type (causal, statistical, trial design), linguistic phenomena (negation, numerical reasoning), clinical domain (cardio, oncology), or class (yes/no/maybe) to guide targeted improvements.
- Robustness and security:
- No robustness evaluation against noisy abstracts, contradicting evidence, prompt injection, or distribution shifts (e.g., newer PubMed years, non-PubMed sources).
- Retrieval and tool augmentation:
- The approach is purely parametric; benefits of retrieval-augmented generation (e.g., PubMed search with grounding and citation), tool use (calculators, guidelines), or external verifiers are unexplored.
- Deployment constraints:
- Inference latency, memory footprint, and throughput—especially with VoC—are not quantified; suitability for clinical settings with time constraints is unknown.
- Transparency and release:
- It is unclear whether the model, prompts, and curated datasets (with licenses) will be released; lack of artifacts limits independent verification and broader impact.
- Pseudo-labeling of PQA-U remains uncertain:
- No human auditing of pseudo-label accuracy or estimated noise rate; no confidence-thresholding, self-training iterations, or co-teaching strategies to mitigate label noise.
- Outdated or incomplete baselines:
- Comparisons exclude several recent open-source medical LLMs (e.g., LLaMA-2/3-based med models, BioGPT-XL, ClinicalCamel, PMC-LLaMA, MedAlpaca variants), limiting the strength of claims about competitiveness.
- Clinical integration and governance:
- No discussion of how the system would integrate with clinical workflows, EHRs, audit trails, or regulatory requirements (e.g., ISO/IEC, FDA/EMA guidance for clinical AI).
Collections
Sign up for free to add this paper to one or more collections.