Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering (2403.04890v3)
Abstract: In this paper, we propose a modified version of the MedQA-USMLE dataset, named MEDQA-OPEN, which contains open-ended medical questions without options to mimic clinical scenarios, along with clinician-approved reasoned answers. Additionally, we implement a prompt driven by Chain of Thought (CoT) reasoning, CLINICR, to mirror the prospective process of incremental reasoning, reaching a correct response to medical questions. We empirically demonstrate how CLINICR outperforms the state-of-the-art 5-shot CoT-based prompt (Li\'evin et al., 2022). We also present an approach that mirrors real-life clinical practice by first exploring multiple differential diagnoses through MCQ-CLINICR and subsequently narrowing down to a final diagnosis using MCQ-ELIMINATIVE. Finally, emphasizing the importance of response verification in medical settings, we utilize a reward model mechanism, replacing the elimination process performed by MCQ-ELIMINATIVE.
- Katherine A Batterton and Kimberly N Hale. 2017. The likert scale what it is and how to use it. Phalanx, 50(2):32–39.
- E Bolton et al. 2022. Pubmedgpt 2.7 b. Technical report, Technical report. Stanford University Center for Research on Foundation ….
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- The future landscape of large language models in medicine. Communications Medicine, 3(1):141.
- Compositional semantic parsing with large language models. arXiv preprint arXiv:2209.15003.
- A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
- Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143.
- Explainable ai for clinical risk prediction: a survey of concepts, methods, and modalities. arXiv preprint arXiv:2308.08407.
- Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
- Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR.
- Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
- Large language models in medicine. Nature medicine, 29(8):1930–1940.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Biomedlm: a domain-specific large language model for biomedical text. MosaicML. Accessed: Dec, 23(3):2.
- Secrets of rlhf in large language models part ii: Reward modeling.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Medical exam question answering with large-scale reading comprehension. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348.