Towards Conversational Diagnostic AI (2401.05654v1)
Abstract: At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. AI systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a LLM based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI.
- George Libman Engel and William L Morgan “Interviewing the patient” Saunders, Philadelphia, London, 1973
- “Contributions of the history, physical examination, and laboratory investigation in making medical diagnoses.” In Western Journal of Medicine 156.2 BMJ Publishing Group, 1992, pp. 163
- “Relative contributions of history-taking, physical examination, and laboratory investigation to diagnosis and management of medical outpatients.” In Br Med J 2.5969 British Medical Journal Publishing Group, 1975, pp. 486–489
- Jerome P Kassirer “Teaching clinical medicine by iterative hypothesis testing: let’s preach what we practice” In New England Journal of Medicine 309.15 Mass Medical Soc, 1983, pp. 921–923
- “A study on relative contributions of the history, physical examination and investigations in making medical diagnosis.” In The Journal of the Association of Physicians of India 48.8, 2000, pp. 771–775
- Gerald Sandler “The importance of the history in the medical clinic and the cost of unnecessary tests” In American heart journal 100.6 Elsevier, 1980, pp. 928–931
- Jonathan Silverman, Suzanne Kurtz and Juliet Draper “Skills for communicating with patients” crc press, 2016
- Timothy Rennie, Jennifer Marriott and Tina P Brock “Global supply of health professionals” In N Engl J Med 370.23, 2014, pp. 2246–7
- OpenAI “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
- Google “PaLM 2 Technical Report”, https://ai.google/static/documents/palm2techreport.pdf, 2023
- Google Deepmind “Gemini: A Family of Highly Capable Multimodal Models”, https://assets.bwbx.io/documents/users/iqjWHBFdfxIU/r7G7RrtT6rnM/v0, 2023
- “Large Language Models Encode Clinical Knowledge” In arXiv preprint arXiv:2212.13138, 2022
- “Towards expert-level medical question answering with large language models” In arXiv preprint arXiv:2305.09617, 2023
- “Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine” In arXiv preprint arXiv:2311.16452, 2023
- “LaMDA: Language models for dialog applications” In arXiv preprint arXiv:2201.08239, 2022
- OpenAI “Introducing ChatGPT”, 2022 OpenAI URL: https://openai.com/blog/chatgpt
- “Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding” In arXiv preprint arXiv:2305.12031, 2023
- “MEDITRON-70B: Scaling Medical Pretraining for Large Language Models” In arXiv preprint arXiv:2311.16079, 2023
- David Levine “History taking is a complex skill” In BMJ 358 British Medical Journal Publishing Group, 2017
- Ann King and Ruth B Hoppe ““Best practice” for patient-centered communication: a narrative review” In Journal of graduate medical education 5.3 The Accreditation Council for Graduate Medical Education Suite 2000, 515 …, 2013, pp. 385–393
- “What disease does this patient have? a large-scale open domain question answering dataset from medical exams” In Applied Sciences 11.14 MDPI, 2021, pp. 6421
- “MIMIC-III, a freely accessible critical care database” In Scientific data 3.1 Nature Publishing Group, 2016, pp. 1–9
- “Speech recognition for medical conversations” In arXiv preprint arXiv:1711.07274, 2017
- “A computational approach to understanding empathy expressed in text-based mental health support” In arXiv preprint arXiv:2009.08441, 2020
- “Improving language model negotiation with self-play and in-context learning from ai feedback” In arXiv preprint arXiv:2305.10142, 2023
- “Overview of the mediqa-chat 2023 shared tasks on the summarization & generation of doctor-patient conversations” In Proceedings of the 5th Clinical Natural Language Processing Workshop, 2023, pp. 503–513
- “Overview of the ImageCLEF 2023: Multimedia Retrieval in Medical, Social Media and Internet Applications” In International Conference of the Cross-Language Evaluation Forum for European Languages, 2023, pp. 370–396 Springer
- “DialMed: A Dataset for Dialogue-based Medication Recommendation” In arXiv preprint arXiv:2203.07094, 2022
- “Incorporating Medical Knowledge to Transformer-based Language Models for Medical Dialogue Generation” In Proceedings of the 21st Workshop on Biomedical Language Processing, 2022, pp. 110–115
- Jane Dacre, Mike Besser and Patricia White “MRCP (UK) PART 2 Clinical Examination (PACES): a review of the first four examination sessions (June 2001–July 2002)” In Clinical Medicine 3.5 Royal College of Physicians, 2003, pp. 452
- “The Objective Structured Clinical Examination. The new gold standard for evaluating postgraduate clinical performance.” In Annals of surgery 222.6 Lippincott, Williams,Wilkins, 1995, pp. 735
- “The objective structured clinical examination: a step in the direction of competency-based evaluation” In Archives of pediatrics & adolescent medicine 154.7 American Medical Association, 2000, pp. 736–741
- Ronald M Epstein and Edward M Hundert “Defining and assessing professional competence” In Jama 287.2 American Medical Association, 2002, pp. 226–235
- Joel L Horowitz “The bootstrap” In Handbook of econometrics 5 Elsevier, 2001, pp. 3159–3228
- “Controlling the false discovery rate: a practical and powerful approach to multiple testing” In Journal of the Royal statistical society: series B (Methodological) 57.1 Wiley Online Library, 1995, pp. 289–300
- Robert F Woolson “Wilcoxon signed-rank test” In Wiley encyclopedia of clinical trials Wiley Online Library, 2007, pp. 1–3
- “Teaching history taking to medical students: a systematic review” In BMC medical education 15.1 BioMed Central, 2015, pp. 1–12
- “Effect of communications training on medical student performance” In Jama 290.9 American Medical Association, 2003, pp. 1157–1165
- Gregory Makoul “Communication skills education in medical school and beyond” In Jama 289.1 American Medical Association, 2003, pp. 93–93
- “Teaching and assessing communication skills in the postgraduate medical setting: a systematic scoping review” In BMC medical education 21 Springer, 2021, pp. 1–19
- “Improving communication skills: a course for academic medical center surgery residents and faculty” In Journal of Surgical education 72.6 Elsevier, 2015, pp. e202–e211
- “UK consensus statement on the content of communication curricula in undergraduate medical education” In Medical education 42.11 Wiley Online Library, 2008, pp. 1100–1107
- Hanneke De Haes and Jozien Bensing “Endpoints in medical communication research, proposing a framework of functions and outcomes” In Patient education and counseling 74.3 Elsevier, 2009, pp. 287–294
- Ronald M Epstein and Richard L Street Jr “Patient-centered communication in cancer care: promoting healing and reducing suffering”, 2007
- “Assessing communication competence: a review of current tools” In Family Medicine 37.3, 2005, pp. 184–92
- Jonathan R Nichol, Joshua Henrina Sundjaja and Grant Nelson “Medical history” StatPearls Publishing, Treasure Island (FL), 2018 URL: http://europepmc.org/books/NBK534249
- Claire Denness “What are consultation models for?” In InnovAiT 6.9 Sage Publications Sage UK: London, England, 2013, pp. 592–599
- “Implementation of virtual OSCE in health professions education: A systematic review” In Medical Education Wiley Online Library, 2023
- “Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling” In arXiv preprint arXiv:1810.00278, 2018
- “Airdialogue: An environment for goal-oriented dialogue research” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 3844–3854
- “Decision-Oriented Dialogue for Human-AI Collaboration”, 2023 arXiv:2305.20076 [cs.CL]
- “Attention is all you need” In Advances in neural information processing systems 30, 2017
- “Training language models to follow instructions with human feedback” In arXiv preprint arXiv:2203.02155, 2022
- “Ethical-advice taker: Do language models understand natural language interventions?” In arXiv preprint arXiv:2106.01465, 2021
- “Self-critiquing models for assisting human evaluators” In arXiv preprint arXiv:2206.05802, 2022
- “Training language models with language feedback at scale” In arXiv preprint arXiv:2303.16755, 2023
- “Improving alignment of dialogue agents via targeted human judgements” In arXiv preprint arXiv:2209.14375, 2022
- “Constitutional AI: Harmlessness from AI feedback” In arXiv preprint arXiv:2212.08073, 2022
- “A general language assistant as a laboratory for alignment” In arXiv preprint arXiv:2112.00861, 2021
- “Clinical BERTScore: An Improved Measure of Automatic Speech Recognition Performance in Clinical Settings” In arXiv preprint arXiv:2303.05737, 2023
- “Overview of the medical question answering task at TREC 2017 LiveQA.” In TREC, 2017, pp. 1–12
- “The diagnostic and triage accuracy of digital and online symptom checker tools: a systematic review” In NPJ Digital Medicine 5.1 Nature Publishing Group UK London, 2022, pp. 118
- “Diagnostic accuracy of artificial intelligence in virtual primary care” In Mayo Clinic Proceedings: Digital Health 1.4 Elsevier, 2023, pp. 480–489
- “Testing the Limits of Language Models: A Conversational Framework for Medical AI Assessment” In medRxiv Cold Spring Harbor Laboratory Press, 2023, pp. 2023–09
- “MedDialog: Large-scale medical dialogue datasets” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9241–9250
- “MedDG: an entity-centric medical consultation dataset for entity-aware medical dialogue generation” In CCF International Conference on Natural Language Processing and Chinese Computing, 2022, pp. 447–459 Springer
- “Cdialog: A multi-turn COVID-19 conversation dataset for entity-aware dialog generation” In arXiv preprint arXiv:2212.06049, 2022
- “ReMeDi: Resources for Multi-domain, Multi-service, Medical Dialogues” In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 3013–3024
- “Key challenges for delivering clinical impact with artificial intelligence” In BMC medicine 17 Springer, 2019, pp. 1–9
- “Towards Accurate Differential Diagnosis with Large Language Models” In arXiv preprint arXiv:2312.00164, 2023
- Zahir Kanjee, Byron Crowe and Adam Rodman “Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge” In JAMA, 2023
- “Evaluation of symptom checkers for self diagnosis and triage: audit study” In BMJ 351 British Medical Journal Publishing Group, 2015
- “Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum” In JAMA Internal Medicine, 2023
- OpenAI “ChatGPT”, 2023 OpenAI URL: https://chat.openai.com/chat
- Sara Carrillo de Albornoz, Kah-Ling Sia and Anthony Harris “The effectiveness of teleconsultations in primary care: systematic review” In Family Practice 39.1 Oxford University Press UK, 2022, pp. 168–182
- “Virtual primary care: fragmentation or integration?” In The Lancet Digital Health 1.7 Elsevier, 2019, pp. e330–e331
- “Asynchronous Remote Communication as a Tool for Care Management in Primary Care: A Rapid Review of the Literature” In International Journal of Integrated Care 22.3 Ubiquity Press, 2022
- “Comparing the content and quality of video, telephone, and face-to-face consultations: a non-randomised, quasi-experimental, exploratory study in UK primary care” In British Journal of General Practice 69.686 British Journal of General Practice, 2019, pp. e595–e604
- “Patient satisfaction with time spent with their physician” In Journal of Family Practice 47.2 [New York, Appleton-Century-Crofts], 1998, pp. 133–138
- “The effect of screen-to-screen versus face-to-face consultation on doctor-patient communication: an experimental study with simulated patients” In Journal of medical Internet research 19.12 JMIR Publications Toronto, Canada, 2017, pp. e421
- “Trade-offs in high-volume primary care practice” In Journal of Family Practice 46.5 [New York, Appleton-Century-Crofts], 1998, pp. 397–402
- “Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians” In Nature Medicine 29.7 Nature Publishing Group US New York, 2023, pp. 1814–1820
- Julian Bird and Steven A Cohen-Cole “The three-function model of the medical interview” In Methods in teaching consultation-liaison psychiatry 20 Karger Publishers, 1990, pp. 65–88
- Agnes G Rezler, James A Woolliscroft and Summers G Kalishman “What is missing from patient histories?” In Medical Teacher 13.3 Taylor & Francis, 1991, pp. 245–252
- Ellen E Rosenberg “Lessons for Clinicians From Physician-Patient” In Arch Fam Med 6, 1997, pp. 279–283
- Robert Charles Smith “Patient-centered interviewing: an evidence-based method” Lippincott Williams & Wilkins, 2002
- Donald M Berwick, Thomas W Nolan and John Whittington “The triple aim: care, health, and cost” In Health affairs 27.3 Project HOPE-The People-to-People Health Foundation, Inc., 2008, pp. 759–769
- “From triple to quadruple aim: care of the patient requires care of the provider” In The Annals of Family Medicine 12.6 Annals Family Med, 2014, pp. 573–576
- “Physician communication skills and malpractice claims. A complex relationship.” In Western Journal of Medicine 150.3 BMJ Publishing Group, 1989, pp. 356
- “Doctors’ non-verbal behaviour in consultations: look at the patient before you look at the computer” In British Journal of General Practice 60.571 British Journal of General Practice, 2010, pp. 76–78
- “Inter-Cultural Communication Skills Training in Medical Schools: A Systematic Review” In Medical Research Archives 11.4, 2023
- “History taking as a diagnostic tool in children with chronic cough” In Frontiers in pediatrics 10 Frontiers, 2022, pp. 850912
- Winny Setyonugroho, Kieran M Kennedy and Thomas JB Kropmans “Reliability and validity of OSCE checklists used to assess the communication skills of undergraduate medical students: a systematic review” In Patient education and counseling 98.12 Elsevier, 2015, pp. 1482–1491
- “Taxonomy of risks posed by language models” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 214–229
- “Bias and Fairness in Large Language Models: A Survey”, 2023 arXiv:2309.00770 [cs.CL]
- “Patient race/ethnicity and quality of patient–physician communication during medical visits” In American journal of public health 94.12 American Public Health Association, 2004, pp. 2084–2090
- Debra L Roter, Judith A Hall and Yutaka Aoki “Physician gender effects in medical communication: a meta-analytic review” In Jama 288.6 American Medical Association, 2002, pp. 756–764
- “Red teaming language models with language models” In arXiv preprint arXiv:2202.03286, 2022
- “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned” In arXiv preprint arXiv:2209.07858, 2022
- Jiahao Yu, Xingwei Lin and Xinyu Xing “Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts” In arXiv preprint arXiv:2309.10253, 2023
- “MART: Improving LLM Safety with Multi-round Automatic Red-Teaming” In arXiv preprint arXiv:2311.07689, 2023
- “Model cards for model reporting” In Proceedings of the conference on fairness, accountability, and transparency, 2019, pp. 220–229
- “Interactive model cards: A human-centered approach to model documentation” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 427–439
- Mahima Pushkarna, Andrew Zaldivar and Oddur Kjartansson “Data cards: Purposeful and transparent dataset documentation for responsible ai” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 1776–1826
- “How Linguistically Fair Are Multilingual Pre-Trained Language Models?” In Proceedings of the AAAI conference on artificial intelligence 35.14, 2021, pp. 12710–12718
- “You reap what you sow: On the challenges of bias evaluation under multilingual settings” In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, 2022, pp. 26–41
- “MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks”, 2023 arXiv:2311.07463 [cs.CL]
- “Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages” In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Association for Computational Linguistics, 2023 DOI: 10.18653/v1/2023.acl-long.61
- “Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts”, 2023 arXiv:2306.11372 [cs.CL]
- “Having Beer after Prayer? Measuring Cultural Bias in Large Language Models”, 2023 arXiv:2305.14456 [cs.CL]
- Krithika Ramesh, Sunayana Sitaram and Monojit Choudhury “Fairness in Language Models Beyond English: Gaps and Challenges”, 2023 arXiv:2302.12578 [cs.CL]
- “Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?”, 2023 arXiv:2309.07462 [cs.CL]
- “Conformal Language Modeling”, 2023 arXiv:2306.10193 [cs.CL]
- “Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness”, 2023 arXiv:2308.16175 [cs.CL]
- “Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models”, 2023 arXiv:2307.10236 [cs.SE]
- “Uncertainty-aware Language Modeling for Selective Question Answering”, 2023 arXiv:2311.15451 [cs.CL]
- “Mind the gap: Assessing temporal generalization in neural language models” In Advances in Neural Information Processing Systems 34, 2021, pp. 29348–29363
- “Chain-of-thought prompting elicits reasoning in large language models” In Advances in Neural Information Processing Systems 35, 2022, pp. 24824–24837