Emergent Mind

BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

(2403.18421)
Published Mar 27, 2024 in cs.CL and cs.AI

Abstract

Models such as GPT-4 and Med-PaLM 2 have demonstrated impressive performance on a wide variety of biomedical NLP tasks. However, these models have hundreds of billions of parameters, are computationally expensive to run, require users to send their input data over the internet, and are trained on unknown data sources. Can smaller, more targeted models compete? To address this question, we build and release BioMedLM, a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles. When fine-tuned, BioMedLM can produce strong multiple-choice biomedical question-answering results competitive with much larger models, such as achieving a score of 57.3% on MedMCQA (dev) and 69.0% on the MMLU Medical Genetics exam. BioMedLM can also be fine-tuned to produce useful answers to patient questions on medical topics. This demonstrates that smaller models can potentially serve as transparent, privacy-preserving, economical and environmentally friendly foundations for particular NLP applications, such as in biomedicine. The model is available on the Hugging Face Hub: https://huggingface.co/stanford-crfm/BioMedLM.

Comparison of BiomedLM's and GPT-Neo's performance on biomedical language processing tasks.

Overview

  • BioMedLM is a 2.7 billion parameter model specially trained on PubMed abstracts and articles for biomedical NLP tasks, illustrating competitive performance.

  • Designed as a GPT-style model, BioMedLM prioritizes efficiency and specialization with training solely on PubMed data, demonstrating feasibility on modest hardware.

  • It achieves impressive results on biomedical question-answering benchmarks, outperforming or closely rivaling larger and generalist models in specific tasks.

  • BioMedLM's approach addresses key issues in healthcare NLP applications, including data privacy, cost-effectiveness, and reducing environmental impact.

BioMedLM: A Specialized Language Model for Biomedical NLP Tasks

Introduction

In recent years, language models such as GPT-4 and Med-PaLM 2 have significantly advanced the field of NLP across various domains, including biomedicine. However, their vast size, proprietary nature, and resource-intensive demands pose serious practical limitations, especially for applications requiring data privacy, cost-effectiveness, and environmental sustainability. Addressing these challenges, the study introduces BioMedLM, a 2.7 billion parameter model, specifically trained on PubMed abstracts and full articles. BioMedLM demonstrates competitive performance on biomedical NLP tasks, such as multiple-choice question-answering and patient-focused medical question generation, against its significantly larger counterparts.

Model Design and Training

BioMedLM is architected as a GPT-style autoregressive model, with a domain-specific tokenizer trained to efficiently handle biomedical terminology. Unlike large-scale general models, BioMedLM's training exclusively leverages PubMed data, aiming at improved efficiency in biomedical contexts without the computational and financial overheads associated with larger models. The training was executed on 128 40GB Nvidia A100 GPUs, demonstrating the feasibility of training and running medium-sized models on modest hardware configurations.

Evaluation on Biomedical Tasks

BioMedLM's performance was rigorously evaluated across a suite of biomedical question-answering tasks including MedMCQA, MedQA, MMLU, PubMedQA, and BioASQ. Notably, BioMedLM achieved a score of 57.3% on MedMCQA and 69.0% on the MMLU Medical Genetics exam, outperforming or closely rivaling models like GPT-Neo 2.7B and even some larger models on specific tasks. This reveals that a domain-specific focus during training can yield models with competitive task performance, while also being more accessible and practical for specialized applications.

Practical Implications and Future Directions

The study underscores the capabilities of smaller, domain-focused models to meet or exceed the performance of larger, generalist models on specific tasks. BioMedLM's approach addresses several critical concerns in deploying NLP technologies in sensitive areas like healthcare:

  • Privacy and Security: With full training on publicly available PubMed data and the ability to run on local hardware, BioMedLM offers a transparent and secure alternative to proprietary models that require data transmission over the internet.
  • Cost and Accessibility: The training and inference efficiency of BioMedLM make it a feasible option for organizations with limited budgets, democratizing access to advanced NLP capabilities.
  • Environmental Impact: By demonstrating strong performance with significantly fewer parameters, BioMedLM presents an environmentally friendlier option compared to training and operating larger models.

Looking ahead, this work opens several avenues for future research, including the exploration of training techniques that further optimize performance and efficiency for domain-specific models. Additionally, extending the methodology to other specialized fields could yield similarly effective models across a broader range of disciplines.

Conclusion

BioMedLM exemplifies the potential of medium-sized, domain-focused models to achieve high performance on specialized tasks, challenging the prevailing assumption that larger models always perform better. By balancing efficiency with capability, BioMedLM represents a significant step forward in making advanced NLP technology more accessible, transparent, and sustainable, particularly in critical fields such as biomedicine.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.