Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain (2404.07613v1)
Abstract: Research on language technology for the development of medical applications is currently a hot topic in Natural Language Understanding and Generation. Thus, a number of LLMs have recently been adapted to the medical domain, so that they can be used as a tool for mediating in human-AI interaction. While these LLMs display competitive performance on automated medical texts benchmarks, they have been pre-trained and evaluated with a focus on a single language (English mostly). This is particularly true of text-to-text models, which typically require large amounts of domain-specific pre-training data, often not easily accessible for many languages. In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain. Additionally, we present two new evaluation benchmarks for all four languages with the aim of facilitating multilingual research in this domain. A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks, while being competitive with current state-of-the-art LLMs in English.
- MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4488–4508, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Rodrigo Agerri and Eneko Agirre. 2023. Lessons learned from the evaluation of Spanish Language Models. Proces. del Leng. Natural, 70:157–170.
- SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Pretrained biomedical language models for clinical NLP in Spanish. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 193–199, Dublin, Ireland. Association for Computational Linguistics.
- Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451. Association for Computational Linguistics.
- No language left behind: Scaling human-centered machine translation. CoRR, abs/2207.04672.
- NCBI disease corpus: A resource for disease name recognition and concept normalization. J. Biomed. Informatics, 47:1–10.
- Overview of the DIANN task: Disability annotation task. In Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2018), Sevilla, Spain, September 18th, 2018, volume 2150 of CEUR Workshop Proceedings, pages 1–14. CEUR-WS.org.
- T-projection: High quality annotation projection for sequence labeling tasks. CoRR, abs/2212.10548.
- PharmaCoNER: Pharmacological substances, compounds and proteins named entity recognition track. In Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, BioNLP-OST@EMNLP-IJNCLP 2019, Hong Kong, China, November 4, 2019, pages 1–10. Association for Computational Linguistics.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Heal., 3(1):2:1–2:23.
- DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
- LoRA: Low-rank adaptation of large language models. arXiv preprint, 2106.09685.
- Glot500: Scaling multilingual corpora and language models to 500 languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1082–1117. Association for Computational Linguistics.
- Findings of the WMT 2017 biomedical translation shared task. In Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017, pages 234–247. Association for Computational Linguistics.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinform., 36(4):1234–1240.
- Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database J. Biol. Databases Curation, 2016.
- BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6).
- The E3C project: European clinical case corpus. In Proceedings of the Annual Conference of the Spanish Association for Natural Language Processing: Projects and Demonstrations (SEPLN-PD 2021) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), Málaga, Spain, September, 2021, volume 2968 of CEUR Workshop Proceedings, pages 17–20. CEUR-WS.org.
- Transformer-based argument mining for healthcare applications. In ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020), volume 325 of Frontiers in Artificial Intelligence and Applications, pages 2108–2115. IOS Press.
- Enhancing evidence-based medicine with natural language argumentative analysis of clinical trials. Artificial Intelligence in Medicine, 118:102098.
- SciFive: a text-to-text transformer model for biomedical literature. CoRR, abs/2106.03598.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
- Large language models encode clinical knowledge. arXiv preprint, abs/2212.13138.
- Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
- Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
- LLaMA: Open and efficient foundation language models. arXiv preprint, 2302.13971.
- An overview of the BioASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform., 16:138:1–138:28.
- Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
- ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation. ArXiv preprint, abs/2306.09968.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- PMC-LLaMA: Towards building open-source language models for medicine. arXiv preprint, 2304.14454.
- mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.
- LinkBERT: Pretraining language models with document links. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 8003–8016. Association for Computational Linguistics.
- Anar Yeginbergenova and Rodrigo Agerri. 2023. Cross-lingual argument mining in the medical domain. arXiv preprint, abs/2301.10527.
- Ander Itxaurrondo. 2018. SPACCC: Spanish Clinical Case Corpus. Barcelona Supercomputing Center. PID https://github.com/PlanTL-GOB-ES/SPACCC.
- Common Crawl. 2022. Common Crawl. Common Crawl. PID https://commoncrawl.org/.
- Institute of Formal and Applied Linguistics. 2017. UFAL Medical Corpus v. 1.0. Charles University, Czech Republic. PID https://ufal.mff.cuni.cz/ufal_medical_corpus.
- National Library of Medicine. 2022a. Clinical Trials. National Library of Medicine. PID https://clinicaltrials.gov/.
- National Library of Medicine. 2022b. PubMed. National Library of Medicine. PID https://pubmed.ncbi.nlm.nih.gov.
- Iker García-Ferrero (14 papers)
- Rodrigo Agerri (41 papers)
- Aitziber Atutxa Salazar (2 papers)
- Elena Cabrio (11 papers)
- Iker de la Iglesia (5 papers)
- Alberto Lavelli (6 papers)
- Bernardo Magnini (15 papers)
- Benjamin Molinet (1 paper)
- Johana Ramirez-Romero (1 paper)
- German Rigau (30 papers)
- Jose Maria Villa-Gonzalez (2 papers)
- Serena Villata (12 papers)
- Andrea Zaninello (3 papers)