Towards Building Multilingual Language Model for Medicine (2402.13963v4)
Abstract: The development of open-source, multilingual medical LLMs can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source LLMs on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, we present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs.
- Openai. introducing chatgpt. https://openai.com/blog/chatgpt/, 2023.
- Rohan Anil et al. Palm 2 technical report. ArXiv, abs/2305.10403, 2023.
- BIT-ENGD. baidu_baike, 2023. GitHub repository.
- Rumedbench: A russian medical language understanding benchmark, 2022.
- Wikimedia Foundation. Wikimedia downloads.
- Medalpaca – an open-source collection of medical conversational ai models and training data. Apr 2023.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Institute of Formal and Applied Linguistics. Ufal medical corpus. Online, 2024. Accessed: 2024-01-26.
- Mistral 7b, 2023.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. arXiv preprint arXiv:2009.13081, 2020.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Evaluating GPT-4 and ChatGPT on Japanese medical licensing examinations, 2023.
- Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198, 2023.
- FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain. working paper or preprint, Oct. 2022.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. ArXiv, abs/2005.11401, 2020.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
- Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265, 2023.
- Med-flamingo: A multimodal medical few-shot learner. July 2023. arXiv:2307.15189.
- Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
- BigScienceWorkshop Scao et al. Bloom: A 176b-parameter open-access multilingual language model. Nov 2022.
- Large language models encode clinical knowledge. Nature, 620:172 – 180, 2022.
- Gemini Team. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805, 2023.
- InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Towards generalist biomedical ai. ArXiv, abs/2307.14334, 2023.
- HEAD-QA: A healthcare dataset for complex reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy, July 2019. Association for Computational Linguistics.
- Pmc-llama: Further finetuning llama on medical papers. Apr 2023.
- Towards generalist foundation model for radiology. ArXiv, abs/2308.02463, 2023.
- Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. Mar 2023.
- Almanac — retrieval-augmented language models for clinical medicine. NEJM AI, 1(2):AIoa2300068, 2024.
- Retrieve anything to augment large language models. ArXiv, abs/2310.07554, 2023.
- Bertscore: Evaluating text generation with bert, 2020.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.