Models such as GPT-4 and Med-PaLM 2 have demonstrated impressive performance on a wide variety of biomedical NLP tasks. However, these models have hundreds of billions of parameters, are computationally expensive to run, require users to send their input data over the internet, and are trained on unknown data sources. Can smaller, more targeted models compete? To address this question, we build and release BioMedLM, a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles. When fine-tuned, BioMedLM can produce strong multiple-choice biomedical question-answering results competitive with much larger models, such as achieving a score of 57.3% on MedMCQA (dev) and 69.0% on the MMLU Medical Genetics exam. BioMedLM can also be fine-tuned to produce useful answers to patient questions on medical topics. This demonstrates that smaller models can potentially serve as transparent, privacy-preserving, economical and environmentally friendly foundations for particular NLP applications, such as in biomedicine. The model is available on the Hugging Face Hub: https://huggingface.co/stanford-crfm/BioMedLM.
BioMedLM is a 2.7 billion parameter model specially trained on PubMed abstracts and articles for biomedical NLP tasks, illustrating competitive performance.
Designed as a GPT-style model, BioMedLM prioritizes efficiency and specialization with training solely on PubMed data, demonstrating feasibility on modest hardware.
It achieves impressive results on biomedical question-answering benchmarks, outperforming or closely rivaling larger and generalist models in specific tasks.
BioMedLM's approach addresses key issues in healthcare NLP applications, including data privacy, cost-effectiveness, and reducing environmental impact.
In recent years, language models such as GPT-4 and Med-PaLM 2 have significantly advanced the field of NLP across various domains, including biomedicine. However, their vast size, proprietary nature, and resource-intensive demands pose serious practical limitations, especially for applications requiring data privacy, cost-effectiveness, and environmental sustainability. Addressing these challenges, the study introduces BioMedLM, a 2.7 billion parameter model, specifically trained on PubMed abstracts and full articles. BioMedLM demonstrates competitive performance on biomedical NLP tasks, such as multiple-choice question-answering and patient-focused medical question generation, against its significantly larger counterparts.
BioMedLM is architected as a GPT-style autoregressive model, with a domain-specific tokenizer trained to efficiently handle biomedical terminology. Unlike large-scale general models, BioMedLM's training exclusively leverages PubMed data, aiming at improved efficiency in biomedical contexts without the computational and financial overheads associated with larger models. The training was executed on 128 40GB Nvidia A100 GPUs, demonstrating the feasibility of training and running medium-sized models on modest hardware configurations.
BioMedLM's performance was rigorously evaluated across a suite of biomedical question-answering tasks including MedMCQA, MedQA, MMLU, PubMedQA, and BioASQ. Notably, BioMedLM achieved a score of 57.3% on MedMCQA and 69.0% on the MMLU Medical Genetics exam, outperforming or closely rivaling models like GPT-Neo 2.7B and even some larger models on specific tasks. This reveals that a domain-specific focus during training can yield models with competitive task performance, while also being more accessible and practical for specialized applications.
The study underscores the capabilities of smaller, domain-focused models to meet or exceed the performance of larger, generalist models on specific tasks. BioMedLM's approach addresses several critical concerns in deploying NLP technologies in sensitive areas like healthcare:
Looking ahead, this work opens several avenues for future research, including the exploration of training techniques that further optimize performance and efficiency for domain-specific models. Additionally, extending the methodology to other specialized fields could yield similarly effective models across a broader range of disciplines.
BioMedLM exemplifies the potential of medium-sized, domain-focused models to achieve high performance on specialized tasks, challenging the prevailing assumption that larger models always perform better. By balancing efficiency with capability, BioMedLM represents a significant step forward in making advanced NLP technology more accessible, transparent, and sustainable, particularly in critical fields such as biomedicine.