BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains (2402.10373v3)
Abstract: LLMs have demonstrated remarkable versatility in recent years, offering potential applications across specialized domains such as healthcare and medicine. Despite the availability of various open-source LLMs tailored for health contexts, adapting general-purpose LLMs to the medical domain presents significant challenges. In this paper, we introduce BioMistral, an open-source LLM tailored for the biomedical domain, utilizing Mistral as its foundation model and further pre-trained on PubMed Central. We conduct a comprehensive evaluation of BioMistral on a benchmark comprising 10 established medical question-answering (QA) tasks in English. We also explore lightweight models obtained through quantization and model merging approaches. Our results demonstrate BioMistral's superior performance compared to existing open-source medical models and its competitive edge against proprietary counterparts. Finally, to address the limited availability of data beyond English and to assess the multilingual generalization of medical LLMs, we automatically translated and evaluated this benchmark into 7 other languages. This marks the first large-scale multilingual evaluation of LLMs in the medical domain. Datasets, multilingual evaluation benchmarks, scripts, and all the models obtained during our experiments are freely released.
- GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapore. Association for Computational Linguistics.
- Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations.
- The falcon series of open language models.
- Ensemble of averages: Improving model selection and boosting performance in domain generalization. In Advances in Neural Information Processing Systems.
- Open llm leaderboard. Hugging Face.
- Longformer: The long-document transformer.
- GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, virtual+Dublin. Association for Computational Linguistics.
- Embert: A pre-trained language model for chinese medical text mining. In Web and Big Data, pages 242–257, Cham. Springer International Publishing.
- Pretrained biomedical language models for clinical NLP in Spanish. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 193–199, Dublin, Ireland. Association for Computational Linguistics.
- Swad: Domain generalization by seeking flat minima. In Advances in Neural Information Processing Systems, volume 34, pages 22405–22418. Curran Associates, Inc.
- Meditron-70b: Scaling medical pretraining for large language models.
- Fusing finetuned models for better pretraining.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.
- ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell, 6:1169595.
- GPT3.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems.
- QLoRA: Efficient finetuning of quantized LLMs. In Thirty-seventh Conference on Neural Information Processing Systems.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872.
- Bigbio: A framework for data-centric biomedical natural language processing. In Advances in Neural Information Processing Systems, volume 35, pages 25792–25806. Curran Associates, Inc.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare, 3(1).
- Large language models to identify social determinants of health in electronic health records. npj Digital Medicine, 7(1):6.
- Multi-source domain adaptation with mixture of experts. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4694–4703, Brussels, Belgium. Association for Computational Linguistics.
- Medalpaca – an open-source collection of medical conversational ai models and training data.
- ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326, Dublin, Ireland. Association for Computational Linguistics.
- A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations.
- Patching open-vocabulary models by interpolating weights. In Advances in Neural Information Processing Systems.
- Harold Jeffreys. 1946. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461.
- Mistral 7b.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams.
- PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. Association for Computational Linguistics.
- Dataless knowledge fusion by merging weights of language models. In The Eleventh International Conference on Learning Representations.
- Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
- DrBERT: A robust pre-trained model in French for biomedical and clinical domains. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16207–16221, Toronto, Canada. Association for Computational Linguistics.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge.
- Holistic evaluation of language models. Transactions on Machine Learning Research. Featured Certification, Expert Certification.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- Can large language models reason about medical questions?
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
- ClinicalT5: A generative language model for clinical text. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5436–5443, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Estimating the carbon footprint of bloom, a 176b parameter language model.
- Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6).
- Findings of the WMT 2023 biomedical translation shared task: Evaluation of ChatGPT 3.5 as a comparison system. In Proceedings of the Eighth Conference on Machine Translation, pages 43–54, Singapore. Association for Computational Linguistics.
- Capabilities of gpt-4 on medical challenge problems.
- Can generalist foundation models outcompete special-purpose tuning? case study in medicine.
- OpenAI. 2023. Chatgpt: Language models are few-shot learners. https://openai.com/blog/chatgpt. Accessed: 2024-02-10.
- Gpt-4 technical report.
- Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248–260. PMLR.
- Scifive: a text-to-text transformer model for biomedical literature.
- John Platt et al. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74.
- Language models are unsupervised multitask learners.
- Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Ken Shoemake. 1985. Animating rotation with quaternion curves. In Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’85, page 245–254, New York, NY, USA. Association for Computing Machinery.
- Sidak Pal Singh and Martin Jaggi. 2020. Model fusion via optimal transport. In Advances in Neural Information Processing Systems, volume 33, pages 22045–22055. Curran Associates, Inc.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180.
- Towards expert-level medical question answering with large language models.
- Evaluating and mitigating discrimination in language model decisions.
- Gemini: A family of highly capable multimodal models.
- CamemBERT-bio : Un modèle de langue français savoureux et meilleur pour la santé. In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux – articles longs, pages 323–334, Paris, France. ATALA.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models.
- Zephyr: Direct distillation of lm alignment.
- Clinicalgpt: Large language models finetuned with diverse medical data and comprehensive evaluation.
- Bloom: A 176b-parameter open-access multilingual language model.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23965–23998. PMLR.
- Pmc-llama: Towards building open-source language models for medicine.
- TIES-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems.
- Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue.
- Language models are super mario: Absorbing abilities from homologous models as a free lunch.
- BioBART: Pretraining and evaluation of a biomedical generative language model. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 97–109, Dublin, Ireland. Association for Computational Linguistics.
- Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks.
- Pytorch fsdp: Experiences on scaling fully sharded data parallel.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- A survey of large language models in medicine: Principles, applications, and challenges.