BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text (2403.18421v1)
Abstract: Models such as GPT-4 and Med-PaLM 2 have demonstrated impressive performance on a wide variety of biomedical NLP tasks. However, these models have hundreds of billions of parameters, are computationally expensive to run, require users to send their input data over the internet, and are trained on unknown data sources. Can smaller, more targeted models compete? To address this question, we build and release BioMedLM, a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles. When fine-tuned, BioMedLM can produce strong multiple-choice biomedical question-answering results competitive with much larger models, such as achieving a score of 57.3% on MedMCQA (dev) and 69.0% on the MMLU Medical Genetics exam. BioMedLM can also be fine-tuned to produce useful answers to patient questions on medical topics. This demonstrates that smaller models can potentially serve as transparent, privacy-preserving, economical and environmentally friendly foundations for particular NLP applications, such as in biomedicine. The model is available on the Hugging Face Hub: https://huggingface.co/stanford-crfm/BioMedLM.
- The promise of large language models in health care. The Lancet, 401(10377):641, 2023. doi: 10.1016/s0140-6736(23)00216-7.
- SciBERT: A pretrained language model for scientific text, 2019. URL https://arxiv.org/abs/1903.10676.
- On the summarization of consumer health questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2228–2234, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1215. URL https://aclanthology.org/P19-1215.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. URL https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata.
- Gpt-neox-20b: An open-source autoregressive language model, 2022.
- Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
- MedBLIP: Bootstrapping language-image pre-training from 3D medical images and texts, 2023. URL https://arxiv.org/abs/2305.10799.
- PaLM: Scaling language modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311.
- Understanding accountability in algorithmic supply chains. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, page 1186–1197, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701924. doi: 10.1145/3593013.3594073. URL https://doi.org/10.1145/3593013.3594073.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness, 2022. URL https://arxiv.org/abs/2205.14135.
- Harm De Vries. Go smol or go home, 2023. URL https://www.harmdevries.com/post/model-size-vs-compute-overhead/.
- Informed named entity recognition decoding for generative language models, 2023. URL https://arxiv.org/abs/2308.07791.
- Summarization of clinical information: A conceptual model. Journal of Biomedical Informatics, 44(4):688–699, 2011. doi: 10.1016/j.jbi.2011.03.008.
- The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. URL https://arxiv.org/abs/2101.00027.
- News summarization and evaluation in the era of GPT-3, 2023. URL https://arxiv.org/abs/2209.12356.
- Olmo: Accelerating the science of language models, 2024. arXiv preprint arXiv:2402.00838.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1):1–23, 2021. doi: 10.1145/3458754.
- Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300.
- Huggingface. Huggingface/tokenizers: fast state-of-the-art tokenizers optimized for research and production, 2019. URL https://github.com/huggingface/tokenizers.
- What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, Jul 2021. ISSN 2076-3417. doi: 10.3390/app11146421. URL http://dx.doi.org/10.3390/app11146421.
- PubMedQA: A dataset for biomedical research question answering, 2019. URL https://arxiv.org/abs/1909.06146.
- GeneGPT: Augmenting large language models with domain tools for improved access to biomedical information, 2023. URL https://arxiv.org/abs/2304.09667.
- On the societal impact of open foundation models, 2024. URL https://crfm.stanford.edu/open-fms/paper.pdf.
- Mistral — a journey towards reproducible language model training, 2021. URL https://crfm.stanford.edu/2021/08/26/mistral.html.
- Leveraging pre-trained language models for mining microbiome-disease relationships. BMC Bioinformatics, 24(290), 2023. doi: https://doi.org/10.1186/s12859-023-05411-z.
- Dense passage retrieval for open-domain question answering, 2020. URL https://arxiv.org/abs/2004.04906.
- Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health, 2022. doi: 10.1101/2022.12.19.22283643.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, Sep 2019. doi: 10.1093/bioinformatics/btz682. URL https://doi.org/10.1093%2Fbioinformatics%2Fbtz682.
- Summary of ChatGPT/GPT-4 research and perspective towards the future of large language models, 2023. URL https://arxiv.org/abs/2304.01852.
- Decoupled weight decay regularization, 2019. URL https://arxiv.org/abs/1711.05101.
- Analyzing leakage of personally identifiable information in language models, 2023. URL https://arxiv.org/abs/2302.00539.
- BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6), 2022. doi: 10.1093/bib/bbac409.
- AI chatbots, health privacy, and challenges to HIPAA compliance. JAMA, 330(4):309, 2023. doi: 10.1001/jama.2023.9458.
- The imperative for regulatory oversight of large language models (or generative AI) in healthcare. npj Digital Medicine, 6(1), 2023. doi: 10.1038/s41746-023-00873-0.
- MosaicML. Composer. https://github.com/mosaicml/composer/, 2021.
- MedKnowts: Unified documentation and information retrieval for electronic health records. In The 34th Annual ACM Symposium on User Interface Software and Technology. ACM, October 2021. doi: 10.1145/3472749.3474814. URL https://doi.org/10.1145%2F3472749.3474814.
- Capabilities of GPT-4 on medical challenge problems, 2023a. URL https://arxiv.org/abs/2303.13375.
- Can generalist foundation models outcompete special-purpose tuning? Case study in medicine, 2023b. URL https://arxiv.org/abs/2311.16452.
- Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155.
- MedMCQA : A large-scale multi-subject multi-choice dataset for medical domain question answering, 2022. URL https://arxiv.org/abs/2203.14371.
- Comparative performance evaluation of large language models for extracting molecular interactions and pathway knowledge, 2023. URL https://arxiv.org/abs/2307.08813.
- PyTorch: An imperative style, high-performance deep learning library, 2019. URL https://arxiv.org/abs/1912.01703.
- Carbon emissions and large neural network training, 2021. URL https://arxiv.org/abs/2104.10350.
- Language models are unsupervised multitask learners, 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
- Efficient domain adaptation of language models via adaptive tokenization, 2021. URL https://arxiv.org/abs/2109.07460.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
- Compute trends across three eras of machine learning, 2022. URL https://arxiv.org/abs/2202.05924.
- Creation and adoption of large language models in medicine. JAMA, 330(9):866, 2023. doi: 10.1001/jama.2023.14217.
- The cost of training NLP models: A concise overview, 2020. URL https://arxiv.org/abs/2004.08900.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023a. doi: 10.1038/s41586-023-06291-2.
- Towards expert-level medical question answering with large language models, 2023b. URL https://arxiv.org/abs/2305.09617.
- Dolma: an open corpus of three trillion tokens for language model pretraining research, 2024. arXiv preprint arXiv:2402.00159.
- Galactica: A large language model for science, 2022. URL https://arxiv.org/abs/2211.09085.
- Large language models in medicine. Nature Medicine, 29(8):1930–1940, 2023. doi: 10.1038/s41591-023-02448-8.
- Opportunities and challenges for ChatGPT and large language models in biomedicine and health, 2023. URL https://arxiv.org/abs/2306.10070.
- Together. Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned & chat models, May 2023a. URL https://www.together.ai/blog/redpajama-models-v1.
- Together. RedPajama: An open dataset for training large language models, October 2023b. URL https://github.com/togethercomputer/RedPajama-Data.
- LLaMA: Open and efficient foundation language models, 2023. URL https://arxiv.org/abs/2302.13971.
- An overview of the BioASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16(1), 2015. doi: 10.1186/s12859-015-0564-6.
- Open-ended medical visual question answering through prefix tuning of language models, 2023. URL https://arxiv.org/abs/2303.05977.
- Attention is all you need, 2017. URL https://arxiv.org/abs/1706.03762.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- A systematic review of automatic text summarization for biomedical literature and EHRs. Journal of the American Medical Informatics Association, 28(10):2287–2297, 2021. doi: 10.1093/jamia/ocab143.
- Bfloat16: The secret to high performance on cloud TPUs. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus, 2019.
- Zuoxi Yang. Biomedical information retrieval incorporating knowledge graph for explainable precision medicine. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, page 2486, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450380164. doi: 10.1145/3397271.3401458. URL https://doi.org/10.1145/3397271.3401458.
- Deep bidirectional language-knowledge graph pretraining, 2022a. URL https://arxiv.org/abs/2210.09338.
- LinkBERT: Pretraining language models with document links, 2022b. URL https://arxiv.org/abs/2203.15827.
- Appraising the potential uses and harms of LLMs for medical systematic reviews, 2023. URL https://arxiv.org/abs/2305.11828.
- Benchmarking large language models for news summarization, 2023. URL https://arxiv.org/abs/2301.13848.
- Learning to summarize radiology findings, 2018. URL https://arxiv.org/abs/1809.04698.
- A survey of large language models, 2023. URL https://arxiv.org/abs/2303.18223.
- When does pretraining help? Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, 2021. doi: 10.1145/3462757.3466088.
- Improving the transferability of clinical note section classification models with BERT and large language model ensembles. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 125–130, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.clinicalnlp-1.16. URL https://aclanthology.org/2023.clinicalnlp-1.16.
- Fine-tuning language models from human preferences, 2020. URL https://arxiv.org/abs/1909.08593.