Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People (2403.03640v6)
Abstract: Despite the vast repository of global medical knowledge predominantly being in English, local languages are crucial for delivering tailored healthcare services, particularly in areas with limited medical resources. To extend the reach of medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion. This effort culminates in the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark. In the multilingual medical benchmark, the released Apollo models, at various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B), achieve the best performance among models of equivalent size. Especially, Apollo-7B is the state-of-the-art multilingual medical LLMs up to 70B. Additionally, these lite models could be used to improve the multi-lingual medical capabilities of larger models without fine-tuning in a proxy-tuning fashion. We will open-source training corpora, code, model weights and evaluation benchmark.
- MAQA: Medical Arabic Q&A Dataset. Harvard Dataverse, 2022. doi: 10.7910/DVN/Y2JBEZ.
- Usage of multilingual mobile translation applications in clinical settings. JMIR mHealth and uHealth, 1(1):e2268, 2013.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884, 2023.
- Disc-medllm: Bridging general large language models and real-world medical consultation. arXiv preprint arXiv:2308.14346, 2023.
- Improving medical communication: skills for a complex (and multilingual) clinical world. Canadian respiratory journal, 21:89–91, 2014.
- Spanish biomedical crawled corpus: A large, diverse dataset for spanish biomedical language models, 2021.
- Huatuogpt-ii, one-stage training for medical adaption of llms. arXiv preprint arXiv:2311.09774, 2023a.
- Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023b.
- Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530, 2023.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
- Multilingual consultations in urgent medical care. The Translator, 27(1):75–93, 2021.
- Luigi Daniele and Suphavadeeprasit. Amplify-instruct: Synthetically generated diverse multi-turn conversations for efficient llm training. arXiv preprint arXiv:(coming soon), 2023. URL https://huggingface.co/datasets/LDJnr/Capybara.
- FreedomIntelligence. https://huggingface.co/datasets/FreedomIntelligence/sharegpt-{language}, 2023a.
- FreedomIntelligence. https://huggingface.co/datasets/FreedomIntelligence/WizardV2-Instruct -GPT4-Turbo-Chinese, 2023b.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Overview of bioasq 2021-mesinesp track. evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials. In Overview of BioASQ 2021-MESINESP track. CEUR Workshop Proceedings, 2021.
- Clear-simple corpus for medical french. In ATA, 2018.
- Thuocl, 2016.
- Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023.
- Measuring massive multitask language understanding. In Measuring massive multitask language understanding, 2020.
- Named entity recognition in hindi using hyperspace analogue to language and conditional random field. Pertanika Journal of Science & Technology, 26(4), 2018.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. arXiv preprint arXiv:2009.13081, 2020.
- Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019.
- Daniel L Klayman. Qinghaosu (artemisinin): an antimalarial drug from china. Science, 228(4703):1049–1055, 1985.
- krisfu. https://huggingface.co/datasets/krisfu/awesome-llm-datasets-only-Chinese/tree/main/sft-phase-processed, 2023.
- Frenchmedmcqa: A french multiple-choice question answering dataset for medical domain. arXiv preprint arXiv:2304.04280, 2023a.
- MORFITT : Un corpus multi-labels d’articles scientifiques français dans le domaine biomédical. In Florian Boudin, Béatrice Daille, Richard Dufour, Oumaima Khettari, Maël Houbre, Léane Jourdan, and Nihel Kooli (eds.), 18e Conférence en Recherche d’Information et Applications – 16e Rencontres Jeunes Chercheurs en RI – 30e Conférence sur le Traitement Automatique des Langues Naturelles – 25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, pp. 66–70, Paris, France, 2023b. ATALA. URL https://hal.science/hal-04131591.
- Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
- Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023a.
- Huatuo-26m, a large-scale chinese medical qa dataset. arXiv preprint arXiv:2305.01526, 2023b.
- Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023c.
- Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143, 2022.
- Dexperts: Decoding-time controlled text generation with experts and anti-experts. arXiv preprint arXiv:2105.03023, 2021.
- Tuning language models by proxy. arXiv preprint arXiv:2401.08565, 2024a.
- Benchmarking large language models on cmexam-a comprehensive chinese medical exam dataset. Advances in Neural Information Processing Systems, 36, 2024b.
- Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032, 2023.
- Towards a multilingual medical lexicon. In AMIA Annual Symposium Proceedings, volume 2006, pp. 534. American Medical Informatics Association, 2006.
- Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pp. 353–367. PMLR, 2023.
- Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pp. 248–260. PMLR, 2022.
- Adaptation of machine translation for multilingual information retrieval in the medical domain. Artificial intelligence in medicine, 61(3):165–185, 2014.
- A study of generative large language model for medical research and healthcare. arXiv preprint arXiv:2305.13523, 2023.
- Towards building multilingual language model for medicine. arXiv preprint arXiv:2402.13963, 2024.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
- Determinants of prakriti, the human constitution types of indian traditional medicine and its correlation with contemporary science. Journal of Ayurveda and integrative medicine, 5(3):167, 2014.
- Group differences between countries and between languages in pain-related beliefs, coping, and catastrophizing in chronic pain: a systematic review. Pain Medicine, 21(9):1847–1862, 2020.
- Visual med-alpaca: A parameter-efficient biomedical llm with visual capabilities, 2023.
- Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023.
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
- Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
- Vezora. https://huggingface.co/datasets/Vezora/Tested-22k-Python-Alpaca, 2023.
- Head-qa: A healthcare dataset for complex reasoning. arXiv preprint arXiv:1906.04701, 2019.
- Cmb: A comprehensive medical benchmark in chinese. arXiv preprint arXiv:2308.08833, 2023.
- Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- WuDaoCorpora Text, December 2022. URL https://doi.org/10.57760/sciencedb.o00126.00004.
- A large language model for electronic health records. NPJ Digital Medicine, 5(1):194, 2022.
- The traditional medicine and modern medicine from natural products. Molecules, 21(5):559, 2016.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
- Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023.
- Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075, 2023.
- Medical exam question answering with large-scale reading comprehension. In Proceedings of the AAAI conference on artificial intelligence, 2018.
- Pmc-patients: A large-scale dataset of patient summaries and relations for benchmarking retrieval-based clinical decision support systems. arXiv preprint arXiv:2202.13876, 2022.
- Path to medical agi: Unify domain-specific medical llms with the lowest cost. arXiv preprint arXiv:2306.10765, 2023.