Improving Health Question Answering with Reliable and Time-Aware Evidence Retrieval (2404.08359v1)
Abstract: In today's digital world, seeking answers to health questions on the Internet is a common practice. However, existing question answering (QA) systems often rely on using pre-selected and annotated evidence documents, thus making them inadequate for addressing novel questions. Our study focuses on the open-domain QA setting, where the key challenge is to first uncover relevant evidence in large knowledge bases. By utilizing the common retrieve-then-read QA pipeline and PubMed as a trustworthy collection of medical research documents, we answer health questions from three diverse datasets. We modify different retrieval settings to observe their influence on the QA pipeline's performance, including the number of retrieved documents, sentence selection process, the publication year of articles, and their number of citations. Our results reveal that cutting down on the amount of retrieved documents and favoring more recent and highly cited documents can improve the final macro F1 score up to 10%. We discuss the results, highlight interesting examples, and outline challenges for future research, like managing evidence disagreement and crafting user-friendly explanations.
- Overview of the trec 2019 decision track.
- Time-aware evidence ranking for fact-checking. Journal of Web Semantics, 71:100663.
- Construction of the literature graph in semantic scholar. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 84–91, New Orleans - Louisiana. Association for Computational Linguistics.
- Interpreting predictive probabilities: Model confidence or human label variation? In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 268–277, St. Julian’s, Malta. Association for Computational Linguistics.
- Grade guidelines: 3. rating the quality of evidence. Journal of clinical epidemiology, 64(4):401–406.
- A survey on machine reading comprehension systems. Natural Language Engineering, 28(6):683–732.
- Factors affecting the quality and reliability of online health information. Digital health, 6:2055207620948996.
- A review on fact extraction and verification. ACM Computing Surveys (CSUR), 55(1):1–35.
- Review of artificial intelligence-based question-answering systems in healthcare. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 13(2):e1487.
- Kathi Canese and Sarah Weis. 2013. Pubmed: the bibliographic database. The NCBI handbook, 2(1).
- Danqi Chen and Wen-tau Yih. 2020a. Open-domain question answering. In Proceedings of the 58th annual meeting of the association for computational linguistics: tutorial abstracts, pages 34–37.
- Danqi Chen and Wen-tau Yih. 2020b. Open-domain question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 34–37, Online. Association for Computational Linguistics.
- Overview of the trec 2021 health misinformation track. In Text Retrieval Conference.
- Physical activity across adulthood and physical performance in midlife: findings from a british birth cohort. Am. J. Prev. Med., 41(4):376–384.
- The power of noise: Redefining retrieval for rag systems. arXiv preprint arXiv:2401.14887.
- Consumer health information and question answering: helping consumers find answers to their health-related information needs. Journal of the American Medical Informatics Association, 27(2):194–201.
- Current concepts in healthy aging and physical activity: A viewpoint. J. Aging Phys. Act., 27(5):755–761.
- Corticosteroids for aneurysmal subarachnoid haemorrhage and primary intracerebral haemorrhage. Cochrane Database Syst. Rev., (3):CD004583.
- A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206.
- Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
- Online health information seeking behavior: A systematic review. Healthcare, 9(12).
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
- PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. Association for Computational Linguistics.
- Biomedical question answering: a survey of approaches and challenges. ACM Computing Surveys (CSUR), 55(2):1–36.
- Felipe Kramer and Ángela Ortigoza. 2018. Ginkgo biloba for the treatment of tinnitus. Medwave, 18(6):e7295.
- QED: A Framework and Dataset for Explanations in Question Answering. Transactions of the Association for Computational Linguistics, 9:790–806.
- Less annotating, more classifying: Addressing the data scarcity issue of supervised machine learning with deep transfer learning and bert-nli. Political Analysis, 32(1):84–100.
- Qasa: advanced question answering on scientific articles. In International Conference on Machine Learning, pages 19036–19052. PMLR.
- SemEval-2023 task 11: Learning with disagreements (LeWiDi). In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 2304–2318, Toronto, Canada. Association for Computational Linguistics.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
- Health information seeking behaviors on social media during the covid-19 pandemic among american social networking site users: survey study. Journal of medical Internet research, 23(6):e29802.
- Results of the seventh edition of the bioasq challenge. In Machine Learning and Knowledge Discovery in Databases, pages 553–568, Cham. Springer International Publishing.
- Philhoon Oh and James Thorne. 2023. Detrimental contexts in open-domain question answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11589–11605, Singapore. Association for Computational Linguistics.
- Bente Klarlund Pedersen. 2019. Which type of exercise keeps you young? Curr. Opin. Clin. Nutr. Metab. Care, 22(2):167–173.
- A pilot placebo controlled randomized trial of dexamethasone for chronic subdural hematoma. Can. J. Neurol. Sci., 43(2):284–290.
- Consumer health question answering using off-the-shelf components. In European Conference on Information Retrieval, pages 571–579. Springer.
- Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Computing Surveys, 55(10):1–45.
- Online health information seeking among us adults: Measuring progress toward a healthy people 2020 objective. Public Health Reports, 134(6):617–625. PMID: 31513756.
- On the role of relevance in natural language processing tasks. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1785–1789.
- Averitec: A dataset for real-world claim verification with evidence from the web. Advances in Neural Information Processing Systems, 36.
- Dexamethasone administration and mortality in patients with brain abscess: A systematic review and meta-analysis. World Neurosurg., 115:257–263.
- B Søholm. 1998. Clinical improvement of memory and other cognitive functions by ginkgo biloba: review of relevant literature. Adv. Ther., 15(1):54–65.
- The choice of textual knowledge base in automated claim checking. ACM Journal of Data and Information Quality, 15(1):1–22.
- FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
- David Vilares and Carlos Gómez-Rodríguez. 2019. HEAD-QA: A healthcare dataset for complex reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy. Association for Computational Linguistics.
- Juraj Vladika and Florian Matthes. 2023a. Scientific fact-checking: A survey of resources and approaches. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6215–6230, Toronto, Canada. Association for Computational Linguistics.
- Juraj Vladika and Florian Matthes. 2023b. Sebis at SemEval-2023 task 7: A joint system for natural language inference and evidence retrieval from clinical trial reports. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 1863–1870, Toronto, Canada. Association for Computational Linguistics.
- Juraj Vladika and Florian Matthes. 2024. Comparing knowledge sources for open-domain scientific claim verification. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2103–2114, St. Julian’s, Malta. Association for Computational Linguistics.
- Healthfc: A dataset of health claims for evidence-based medical fact-checking.
- TREC: Experiment and evaluation in information retrieval, volume 63. Citeseer.
- SciFact-open: Towards open-domain scientific claim verification. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4719–4734, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Psychosocial stress at work is associated with increased dementia risk in late life. Alzheimers. Dement., 8(2):114–120.
- Modeling information change in science communication with semantically matched paraphrases. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1783–1807, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Alzheimer’s pathogenic mechanisms and underlying sex difference. Cell. Mol. Life Sci., 78(11):4907–4920.
- Retrieving and reading: A comprehensive survey on open-domain question answering.