Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

97 tokens/sec

GPT-4o

53 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

119

Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain (2404.07613v1)

Published 11 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Research on language technology for the development of medical applications is currently a hot topic in Natural Language Understanding and Generation. Thus, a number of LLMs have recently been adapted to the medical domain, so that they can be used as a tool for mediating in human-AI interaction. While these LLMs display competitive performance on automated medical texts benchmarks, they have been pre-trained and evaluated with a focus on a single language (English mostly). This is particularly true of text-to-text models, which typically require large amounts of domain-specific pre-training data, often not easily accessible for many languages. In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain. Additionally, we present two new evaluation benchmarks for all four languages with the aim of facilitating multilingual research in this domain. A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks, while being competitive with current state-of-the-art LLMs in English.

References (42)

Authors (13)

Iker García-Ferrero (14 papers)
Rodrigo Agerri (41 papers)
Aitziber Atutxa Salazar (2 papers)
Elena Cabrio (11 papers)
Iker de la Iglesia (5 papers)
Alberto Lavelli (6 papers)
Bernardo Magnini (15 papers)
Benjamin Molinet (1 paper)
Johana Ramirez-Romero (1 paper)
German Rigau (30 papers)
Jose Maria Villa-Gonzalez (2 papers)
Serena Villata (12 papers)
Andrea Zaninello (3 papers)

Citations (11)

View on Semantic Scholar

Summary

The paper introduces Medical mT5, an encoder-decoder model fine-tuned on a diverse multilingual medical corpus of 3 billion words across English, Spanish, French, and Italian.
The paper details a robust pre-training methodology that integrates clinical trials, PubMed articles, and medical instructions to enhance medical NLP applications.
The paper reports state-of-the-art performance on non-English benchmarks and competitive results in English, paving the way for more inclusive medical NLP research.

Elevating Medical NLP with Multilingual LLM: Insights from Medical mT5 Development

Introduction

Recent advances in AI and NLP have significantly improved the capabilities of LLMs in various domains, including medicine. However, most of these developments have been confined to the English language, leaving a notable gap in resources and tools for non-English medical texts. Addressing this imbalance, the paper presents Medical mT5, a pioneering open-source text-to-text multilingual model fine-tuned on medical domain data across English, Spanish, French, and Italian. This model is an encoder-decoder framework based on the mT5 architecture, demonstrating state-of-the-art performance in multilingual sequence labeling for the medical domain.

Multilingual Corpus Compilation

The foundation of Medical mT5's success lies in the assembly of a diverse and extensive multilingual corpus tailored to the medical domain. This corpus, touted as the largest of its kind, encompasses 3 billion words across four languages. It blends data from various sources, including clinical trials, PubMed articles, and medical instructions, ensuring a comprehensive representation of the medical lexicon. This corpus not only facilitates the training of Medical mT5 but also sets a new benchmark for multilingual medical NLP research.

Medical mT5 Model Development

Building upon the mT5 framework, Medical mT5 underwent continued pre-training on the assembled multilingual medical corpus. This process involved adapting the model to recognize and process medical terminology consistently across the covered languages. The development yielded two versions of Medical mT5, one with 770M parameters and another with 3B parameters, to cater to different computational capabilities and use cases. The model's architecture and training approach are carefully designed, ensuring it remains accessible to a broad range of researchers and practitioners, especially considering the comparatively low hardware requirements for both training and inference.

Benchmark Creation and Evaluation

To effectively gauge Medical mT5's performance, the research contributes two novel multilingual datasets for sequence labeling and generative question answering in the medical domain. These benchmarks challenge the model across multiple tasks, including Argument Mining and Abstractive Question Answering, facilitating a rigorous and comprehensive evaluation. Medical mT5 demonstrated exemplary performance, surpassing similarly-sized models on non-English benchmarks and achieving competitive results in English, showcasing its robustness and versatility across languages.

Implications and Future Directions

The implications of this research extend far beyond its immediate achievements. Medical mT5 paves the way for more inclusive and equitable medical NLP applications, breaking the English-centric mold that has dominated the field. It highlights the importance of developing multilingual tools that can support medical professionals and patients across diverse linguistic backgrounds. Looking ahead, this work could inspire further efforts to expand the corpus to include more languages and refine the model to tackle a broader range of medical NLP tasks.

Conclusion

The development of Medical mT5 marks a significant step forward in multilingual NLP for the medical domain. By leveraging a vast multilingual medical corpus, this model not only achieves state-of-the-art results in sequence labelling and question answering but also demonstrates the feasibility and importance of extending NLP research and applications to non-English languages in the medical field. Future research will undoubtedly build on this foundation, further enhancing the capabilities of NLP technologies to serve global medical communities.

PDF Markdown

Tweets

https://twitter.com/arankomatsuzaki/status/1778607598784135261

https://twitter.com/fly51fly/status/1779517393154511091

https://twitter.com/arxivsanitybot/status/1778963076567404658

https://twitter.com/knishimae0531/status/1779666264778322372