Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions (2402.18060v5)
Abstract: LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exams or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets.\footnote{Datasets and code are available at \url{https://github.com/HanjieChen/ChallengeClinicalQA}.} JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. In-depth automatic and human evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.
- Overview of the medical question answering task at trec 2017 liveqa. In TREC, pages 1–12.
- Bridging the gap between consumers’ medication questions and trusted answers. In MedInfo, pages 25–29.
- Falcon-40b: an open large language model with state-of-the-art performance. Findings of the Association for Computational Linguistics: ACL, 2023:10755–10773.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Diagnostic accuracy of a large language model in pediatric case studies. JAMA pediatrics.
- Asma Ben Abacha and Dina Demner-Fushman. 2019. A question-entailment approach to question answering. BMC bioinformatics, 20(1):1–23.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Meta-learning via language model in-context tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 719–730, Dublin, Ireland. Association for Computational Linguistics.
- Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Chatgpt in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in Artificial Intelligence, 6:1169595.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Medalign: A clinician-generated dataset for instruction following with electronic medical records. arXiv preprint arXiv:2308.14089.
- Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247.
- Emily Harris. 2023. Large language models answer medical questions accurately, but can’t match clinicians’ knowledge. JAMA.
- The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer.
- Medeval: A multi-level, multi-task, and multi-domain medical benchmark for language model evaluation. arXiv preprint arXiv:2310.14088.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
- Large-scale, diverse, paraphrastic bitexts via sampling and clustering. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 44–54.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
- Hidden flaws behind expert-level accuracy of gpt-4 vision in medicine. arXiv preprint arXiv:2401.08396.
- Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- How chatbots and large language model artificial intelligence systems will reshape modern medicine: Fountain of creativity or pandora’s box? JAMA Internal Medicine.
- Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165, Dublin, Ireland. Association for Computational Linguistics.
- Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375.
- Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR.
- emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732.
- Understanding the impact of explanations on advice-taking: a user study for ai-based clinical decision support systems. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–9.
- Longbox: Evaluating transformers on long-sequence clinical tasks. arXiv preprint arXiv:2311.09564.
- A study of generative large language model for medical research and healthcare. arXiv preprint arXiv:2305.13523.
- Use of gpt-4 to analyze medical records of patients with extensive investigations and delayed diagnosis. JAMA Network Open, 6(8):e2325000–e2325000.
- Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
- Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
- Radqa: A question answering dataset to improve comprehension of radiology reports. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6250–6259.
- Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA internal medicine, 183(9):1028–1030.
- Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Clinicalgpt: Large language models finetuned with diverse medical data and comprehensive evaluation. arXiv preprint arXiv:2306.09968.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Augmenting black-box llms with medical textbooks for clinical question answering. arXiv preprint arXiv:2309.02233.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454.
- Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
- Cliniqg4qa: Generating diverse questions for domain adaptation of clinical question answering. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 580–587. IEEE.
- Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558.