Towards Building Multilingual Language Model for Medicine (2402.13963v4)

Published 21 Feb 2024 in cs.CL

Abstract: The development of open-source, multilingual medical LLMs can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source LLMs on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, we present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs.

References (35)

Citations (30)

View on Semantic Scholar

Summary

The paper introduces MMedLM 2, demonstrating that multilingual training with a 25.5 billion token medical corpus significantly boosts medical QA accuracy.
It constructs MMedC and MMedBench to rigorously benchmark multilingual medical understanding across six languages using detailed rationales and human verification.
The findings imply that advanced multilingual models can bridge language barriers in healthcare, rivaling state-of-the-art systems like GPT-4.

Towards a Multilingual LLM for the Medical Domain

Introduction

The development of LLMs has significantly propelled advancements in NLP applications within the medical domain. Despite notable successes, the preponderance of LLMs' focus on English has hindered their broader application across linguistically diverse regions. The paper discusses the inception of MMedC, a large-scale multilingual medical corpus, and MMedBench, a benchmark for evaluating LLMs' capabilities in medical question-answering across six primary languages. Through rigorous testing, the paper introduces MMedLM 2, a model that not only leverages MMedC for enhanced performance but also exhibits competencies rivalling those of GPT-4 in multilingual medical contexts.

Dataset Construction and Metrics

MMedC: A Multilingual Medical Corpus

MMedC stands distinct with its assembly of 25.5 billion tokens spanning six languages. It derives richness from a variety of sources:

Filtering medical content from a large-scale multilingual corpus
Including texts from medical textbooks and reputable medical websites
Incorporating existing medical corpora

This compilation underscores a collective endeavor to furnish a model that transcends linguistic barriers within the medical domain.

MMedBench: Benchmarking Multilingual Medical Understanding

The advent of MMedBench fills the void for a comprehensive evaluation tool by aggregating medical question-answering datasets across languages and supplementing them with rationale reasoning, hence offering a novel lens through which to assess LLMs. This process involves the augmentation of standard QA pairs with detailed rationales using GPT-4, followed by meticulous human verification to ensure quality and correctness.

Model Evaluation and Insights

The evaluation of MMedC and MMedBench yielded intriguing findings. Consistent with expectations, models trained on MMedC outperformed their contemporaries across various metrics under zero-shot, parameter-efficient fine-tuning (PEFT), and full fine-tuning settings. Notably, MMedLM 2 emerged as a formidable contender, demonstrating remarkable proficiency in multilingual medical question-answering and rationale generation, closely mirroring the performance metrics of GPT-4.

Theoretical and Practical Implications

Enhancing Multilingual Medical AI Research

The paper's endeavor to create MMedC and MMedBench catalyzes the exploration of general medical artificial intelligence (GMAI) and retrieval-augmented generation, facilitating the development of LLMs robust across languages and capable of integrating comprehensive medical knowledge.

Broader Clinical and Educational Outreach

The practical implications are profound, promising to alleviate language barriers in healthcare, tailor models to recognize cultural nuances, and democratize access to medical education globally. This endeavor opens avenues for deploying LLMs in diverse medical settings, ensuring equitable access to quality healthcare information.

Future Directions and Challenges

Despite its achievements, the paper acknowledges limitations such as the corpus's linguistic breadth and the computational scope of the final model. Future work will aim at extending language coverage, scaling model architectures, and refining the model to mitigate hallucination issues. The continuous evolution of MMedC and MMedBench aspires to bolster the development of LLMs that are both linguistically inclusive and deeply entrenched in medical knowledge.

Data and Resources Availability

In a move towards transparency and fostering further research, the authors have made the datasets, codebase, and trained models publicly accessible. This initiative is aimed at encouraging collaborative advancements and facilitating access to resources critical for extending the boundaries of multilingual medical natural language processing.

PDF Markdown

Related Papers

GitHub

Medical Multilingual Benchmark

Tweets

https://twitter.com/WeidiXie/status/1760658894056214800

https://twitter.com/WeidiXie/status/1783432713959309696

https://twitter.com/WeidiXie/status/1827238642185912517

https://twitter.com/GAIS_jp/status/1862043782427873729