Capabilities of Gemini Models in Medicine (2404.18416v2)
Abstract: Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain.
- Considerations for addressing bias in artificial intelligence for health equity. NPJ digital medicine, 6(1):170, 2023.
- GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
- PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering. British Journal of Ophthalmology, 2023.
- Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nature Biomedical Engineering, 7(6):756–779, 2023.
- Pathways: Asynchronous distributed dataflow for ML. Proceedings of Machine Learning and Systems, 4:430–449, 2022.
- Graph of thoughts: Solving elaborate problems with large language models, 2024.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Accuracy of a vision-language model on challenging medical cases. arXiv preprint arXiv:2311.05591, 2023.
- Fairvis: Visual analytics for discovering intersectional bias in machine learning. In 2019 IEEE Conference on Visual Analytics Science and Technology (VAST), pages 46–56. IEEE, 2019.
- Large language models as tool makers. arXiv preprint arXiv:2305.17126, 2023.
- Implementing machine learning in health care—addressing ethical challenges. The New England journal of medicine, 378(11):981, 2018.
- Endo3d: Online workflow analysis for endoscopic surgeries based on 3d cnn and lstm. In OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis: First International Workshop, OR 2.0 2018, 5th International Workshop, CARE 2018, 7th International Workshop, CLIP 2018, Third International Workshop, ISIC 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16 and 20, 2018, Proceedings 5, pages 97–107. Springer, 2018.
- PaLI: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
- PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ digital medicine, 3(1):1–11, 2020.
- Fto obesity variant circuitry and adipocyte browning in humans. New England Journal of Medicine, 373(10):895–907, 2015.
- Cochrane. Standards for reporting plain language summaries (pls) for cochrane diagnostic test accuracy reviews, 2014. https://methods.cochrane.org/sites/methods.cochrane.org.sdt/files/uploads/Draft PLS document.pdf.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
- Revisiting transformer-based models for long document classification. arXiv preprint arXiv:2204.06683, 2022.
- Transformer-XL: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- P. Densen. Challenges and opportunities facing medical education. Transactions of the American Clinical and Climatological Association, 122:48, 2011.
- Paragraph-level simplification of medical texts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4972–4984. Association for Computational Linguistics, June 2021. URL https://www.aclweb.org/anthology/2021.naacl-main.395.
- BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Use of GPT-4 to diagnose complex clinical cases, 2023.
- Building a clinically-focused problem list from medical notes. In Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pages 60–68, 2022.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
- Medalign: A clinician-generated dataset for instruction following with electronic medical records. arXiv preprint arXiv:2308.14089, 2023.
- Extracting information from the text of electronic medical records to improve case detection: a systematic review. Journal of the American Medical Informatics Association, 23(5):1007–1015, 2016.
- The capability of large language models to measure psychiatric functioning. arXiv preprint arXiv:2308.01834, 2023.
- Tackling bias in ai health datasets through the standing together initiative. Nature Medicine, 28(11):2232–2233, 2022.
- Retrieval-augmented generation for large language models: A survey, 2024.
- Gemini Team, Google. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Gemini Team, Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2024. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf.
- Ai recognition of patient race in medical imaging: a modelling study. The Lancet Digital Health, 4(6):e406–e414, 2022.
- Artificial intelligence for phase recognition in complex laparoscopic cholecystectomy. Surgical Endoscopy, 36(12):9215–9223, 2022.
- A real-time spatiotemporal ai model analyzes skill in open surgical videos. arXiv preprint arXiv:2112.07219, 2021.
- Analyzing surgical technique in diverse open surgical videos with multitask machine learning. JAMA surgery, 159(2):185–192, 2024.
- When less is more: a practical approach to searching for evidence-based answers. Journal of the Medical Library Association, 90(3):298, 2002.
- L. D. Gruppen. Clinical reasoning: defining it, teaching it, assessing it, studying it. Western Journal of Emergency Medicine, 18(1):4, 2017.
- D. Gupta and D. Demner-Fushman. Overview of the MedVidQA 2022 shared task on medical video question-answering. In D. Demner-Fushman, K. B. Cohen, S. Ananiadou, and J. Tsujii, editors, Proceedings of the 21st Workshop on Biomedical Language Processing, pages 264–274, Dublin, Ireland, May 2022. Association for Computational Linguistics. 10.18653/v1/2022.bionlp-1.25. URL https://aclanthology.org/2022.bionlp-1.25.
- A dataset for medical instructional video classification and question answering. Scientific Data, 10(1):158, 2023.
- ToolkenGPT: Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems, 36, 2024.
- PathVQA: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2010.12435, 2020.
- Diagnostic strategies in the hypothesis-directed pathfinder system. pages 630–636, January 1984. URL https://www.microsoft.com/en-us/research/publication/diagnostic-strategies-hypothesis-directed-pathfinder-system/.
- W. Hou and Z. Ji. GeneTuring tests GPT models in genomics. BioRxiv, 2023.
- J. Huang and K. C.-C. Chang. Towards reasoning in large language models: A survey, 2023.
- Generative artificial intelligence for chest radiograph interpretation in the emergency department. JAMA Network Open, 6(10):e2336100–e2336100, 2023.
- MAIRA-1: A specialised large multimodal model for radiology report generation. arXiv preprint arXiv:2311.13668, 2023.
- CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019.
- The intersections of gender and class in health status and health care. Global public health, 3(S1):13–24, 2008.
- Associations between age discrimination and health and wellbeing: cross-sectional and prospective analysis of the english longitudinal study of ageing. The Lancet Public Health, 4(4):e200–e208, 2019.
- Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics, 13(6):395–405, 2012.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
- GeneGPT: Augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics, 40(2):btae075, 2024.
- MIMIC-III, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
- MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019a.
- MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019b.
- Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. Jama, 330(1):78–80, 2023.
- Gender disparities in health care. Mount Sinai Journal of Medicine: A Journal of Translational and Personalized Medicine, 79(5):555–559, 2012.
- Information overload in healthcare: too much of a good thing? Zeitschrift für Evidenz, Fortbildung und Qualität im Gesundheitswesen, 109(4-5):285–290, 2015.
- Coronavirus goes viral: quantifying the covid-19 misinformation epidemic on twitter. Cureus, 12(3), 2020.
- Linking the fto obesity rs1421085 variant circuitry to cellular, metabolic, and organismal phenotypes in vivo. Science advances, 7(30):eabg0108, 2021.
- Bloom: A 176b-parameter open-access multilingual language model. 2022.
- Pixel-accurate segmentation of surgical tools based on bounding box annotations. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 5096–5103. IEEE, 2022.
- Towards visual-prompt temporal answering grounding in medical instructional video. arXiv preprint arXiv:2203.06667, 2022.
- LLaVa-Med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024.
- A comparative study of pretrained language models for long clinical text. Journal of the American Medical Informatics Association, 30(2):340–347, 2023.
- Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021.
- A translational perspective towards clinical ai fairness. NPJ Digital Medicine, 6(1):172, 2023.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024.
- The genetics of obesity: from discovery to biology. Nature Reviews Genetics, 23(2):120–133, 2022.
- N. López and V. L. Gadsden. Health inequities, social determinants, and intersectionality. In Perspectives on health equity and social determinants of health. National Academies Press (US), 2017.
- A foundational multimodal vision language ai assistant for human pathology. arXiv preprint arXiv:2312.07814, 2023.
- BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics, 23(6):bbac409, 2022.
- Multimodal composite association score: Measuring gender bias in generative multimodal models. arXiv preprint arXiv:2304.13855, 2023.
- Surgical data science for safe cholecystectomy: a protocol for segmentation of hepatocystic anatomy and assessment of the critical view of safety. arXiv preprint arXiv:2106.10916, 2021.
- Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164, 2023.
- Health inequities in lgbt people and nursing interventions to reduce them: A systematic review. International Journal of Environmental Research and Public Health, 18(22):11801, 2021.
- Meta. Papers with code - medical, 2024. URL https://paperswithcode.com/area/medical. Accessed: 2024-04-26.
- Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265, 2023a.
- Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR, 2023b.
- WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452, 2023.
- A. Novin and E. Meyers. Making sense of conflicting science information: Exploring bias in the search engine result page. In Proceedings of the 2017 conference on conference human information interaction and retrieval, pages 175–184, 2017.
- Weakly supervised convolutional lstm approach for tool tracking in laparoscopic videos. International journal of computer assisted radiology and surgery, 14:1059–1067, 2019.
- Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, 2019.
- Ecg-qa: A comprehensive question answering dataset combined with electrocardiogram. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 66277–66288. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/d0b67349dd16b83b2cf6167fb4e2be50-Paper-Datasets_and_Benchmarks.pdf.
- Large language models propagate race-based medicine. NPJ Digital Medicine, 6(1):195, 2023.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data in brief, 32:106221, 2020.
- LongBoX: Evaluating transformers on long-sequence clinical tasks, 2023.
- Radiology objects in context (roco): a multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3, pages 180–189. Springer, 2018.
- A toolbox for surfacing health equity harms and biases in large language models. arXiv preprint arXiv:2403.12025, 2024.
- Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR, 2023.
- Reasoning with language model prompting: A survey, 2023.
- ToolLLM: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
- Improving language understanding by generative pre-training. 2018.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- AI in health and medicine. Nature medicine, 28(1):31–38, 2022.
- Improving radiology report generation systems by removing hallucinated references to non-existent priors. In A. Parziale, M. Agrawal, S. Joshi, I. Y. Chen, S. Tang, L. Oala, and A. Subbaswamy, editors, Proceedings of the 2nd Machine Learning for Health symposium, volume 193 of Proceedings of Machine Learning Research, pages 456–473. PMLR, 28 Nov 2022.
- Mitigating ethnic disparities in covid-19 and beyond. bmj, 372, 2021.
- Cholec80-cvs: An open dataset with an evaluation of strasberg’s critical view of safety for ai. Scientific Data, 10(1):194, 2023.
- A simple effective method for generation of a permanent record of the critical view of safety during laparoscopic cholecystectomy by intraoperative “doublet” photography. Journal of the American College of Surgeons, 218(2):170–178, 2014.
- Information overload in emergency medicine physicians: a multisite case study exploring the causes, impact, and solutions in four north england national health service trusts. Journal of medical Internet research, 22(7):e19126, 2020.
- Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- After visit summary: Not an afterthought. In Proceedings of the International Symposium on Human Factors and Ergonomics in Health Care, volume 8, pages 85–89. SAGE Publications Sage CA: Los Angeles, CA, 2019.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023a.
- Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023b.
- T. Srinivasan and Y. Bisk. Worst of both worlds: Biases compound in pre-trained vision-and-language models. arXiv preprint arXiv:2104.08666, 2021.
- A. Steptoe and P. Zaninotto. Lower socioeconomic status and the acceleration of aging: An outcome-wide analysis. Proceedings of the National Academy of Sciences, 117(26):14911–14917, 2020.
- Rationale and use of the critical view of safety in laparoscopic cholecystectomy. Journal of the American College of Surgeons, 211(1):132–138, 2010.
- Evaluating AI systems under uncertain ground truth: a case study in dermatology, 2023.
- Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation. 2024.
- The New England Journal of Medicine. Image challenge. https://www.nejm.org/image-challenge, 2024.
- Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031, 2023.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Towards generalist biomedical AI. NEJM AI, 1(3):AIoa2300138, 2024a.
- Towards conversational diagnostic AI. arXiv preprint arXiv:2401.05654, 2024b.
- Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging, 36(1):86–97, 2016.
- Med-halt: Medical domain hallucination test for large language models. arXiv preprint arXiv:2307.15343, 2023.
- A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- PTB-XL, a large publicly available electrocardiography dataset. Scientific data, 7(1):1–15, 2020.
- Electrocardiogram instruction tuning for report generation, 2024.
- Towards intersectionality in machine learning: Including more identities, handling underrepresentation, and performing evaluation. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 336–349, 2022a.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022b.
- Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning, 2024.
- Crowdsourcing dermatology images with google search ads: Creating a real-world skin condition dataset. arXiv preprint arXiv:2402.18545, 2024.
- Causes and prevention of laparoscopic bile duct injuries: analysis of 252 cases from a human factors and cognitive psychology perspective. Annals of surgery, 237(4):460–469, 2003.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Y. Weng and B. Li. Visual answer localization with cross-modal mutual knowledge transfer. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- D. R. Williams and R. Wyatt. Racial bias in health care and health: challenges and opportunities. Jama, 314(6):555–556, 2015.
- How well do llms cite relevant medical references? an evaluation framework and analyses. arXiv preprint arXiv:2402.02008, 2024.
- ELIXR: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317, 2023.
- Fairness with overlapping groups; a probabilistic perspective. Advances in neural information processing systems, 33:4067–4078, 2020.
- Tree of thoughts: Deliberate problem solving with large language models, 2023.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
- Almanac—retrieval-augmented language models for clinical medicine. NEJM AI, 1(2):AIoa2300068, 2024.
- Training small multimodal models to bridge biomedical competency gap: A case study in radiology imaging. arXiv preprint arXiv:2403.08002, 2024.
- Star: Self-taught reasoner bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131, 2024.
- Least-to-most prompting enables complex reasoning in large language models, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.