Rapidly Bootstrapping a Question Answering Dataset for COVID-19 (2004.11339v1)

Published 23 Apr 2020 in cs.CL, cs.AI, and cs.IR

Abstract: We present CovidQA, the beginnings of a question answering dataset specifically designed for COVID-19, built by hand from knowledge gathered from Kaggle's COVID-19 Open Research Dataset Challenge. To our knowledge, this is the first publicly available resource of its type, and intended as a stopgap measure for guiding research until more substantial evaluation resources become available. While this dataset, comprising 124 question-article pairs as of the present version 0.1 release, does not have sufficient examples for supervised machine learning, we believe that it can be helpful for evaluating the zero-shot or transfer capabilities of existing models on topics specifically related to COVID-19. This paper describes our methodology for constructing the dataset and presents the effectiveness of a number of baselines, including term-based techniques and various transformer-based models. The dataset is available at http://covidqa.ai/

Citations (67)

View on Semantic Scholar

Summary

The paper introduces CovidQA, a curated QA dataset pairing COVID-19 questions with evidence from scientific articles.
It details a manual annotation methodology linking questions to precise answer sentences from the CORD-19 corpus.
Baseline results show that unsupervised BM25 and transfer learning T5 models yield promising retrieval performance.

Overview of "Rapidly Bootstrapping a Question Answering Dataset for COVID-19"

The paper "Rapidly Bootstrapping a Question Answering Dataset for COVID-19" presents CovidQA, a question answering dataset specifically tailored for COVID-19-related topics. The dataset is rooted in the CORD-19 Open Research Dataset Challenge organized by Kaggle and aims to provide a foundational resource for evaluating the zero-shot or transfer capabilities of various models. With a version 0.1 comprising 124 question--article pairs, the dataset is not currently adequate for supervised model training but offers some utility as a test set in the context of COVID-19-specific inquiries.

Construction Methodology

The authors detail an approach where CovidQA is manually constructed using "answer tables" derived from curated notebooks submitted to the Kaggle challenge. Each entry in these tables associates a scientific article title with evidence relevant to COVID-19-specific questions. The dataset captures (question, scientific article, exact answer) triples obtained from mapping curated answers to verbatim sections in articles from the CORD-19 corpus. It categorizes questions for clarity and refers to domain-specific terminology without sacrificing precision.

Several strategies for manual annotation were employed to ensure accurate identification of answer spans within articles. Challenges such as maintaining sentence scope and domain specificity are addressed by deriving multiple queries from broader topics. Despite this meticulous approach, the dataset intentionally evades complexity by evaluating models based on whether they accurately pinpoint sentences containing the answer, as opposed to examining precise span boundaries.

Evaluation Design and Baselines

The CovidQA dataset is integrated into a multistage design for end-to-end search systems like the Neural Covidex, where it serves as a testbed for evaluating the relevance of sentences containing answers. The paper reports on baseline models that fall under unsupervised and out-of-domain supervised techniques:

Unsupervised Methods: BM25 and several BERT-based models, inclusive of SciBERT and BioBERT, were evaluated for sentence relevance given a query. BM25 emerged as the superior unsupervised model, outperforming neural methods.
Out-of-Domain Supervised Models: BioBERT and BERT, fine-tuned on datasets like MS MARCO and SQuAD, alongside T5 which also uses MS MARCO, represented the supervised baselines. These configurations were more successful than their unsupervised counterparts, indicating the value of transfer learning in absence of an adequately large COVID-19-specific training dataset.

Results and Implications

Empirical results demonstrate T5's effectiveness among all tested models, especially when processing natural language questions. The work underscores the notion that models benefit from well-formed input questions in contrast to keyword queries. The CovidQA dataset's limited but tangible utility as a test set provides initial guidance for ongoing NLP research.

Discussion

While CovidQA is insufficient for supervised model training, it represents the first publicly available QA dataset focused on COVID-19. This effort serves as a temporary but crucial resource pending more comprehensive datasets. The authors reference parallel initiatives, acknowledging that larger-scale projects, possibly with access to richer domain expertise, will likely supersede their efforts.

The significant manual effort in constructing CovidQA reveals the urgency in establishing rapid methodologies for building domain-specific evaluation resources, especially in crisis scenarios like the COVID-19 pandemic. Future work could aim to refine this methodology and extend the dataset scope by integrating "no answer" documents to evaluate a model's ability to recognize the absence of an answer in a given document.

Conclusions

This paper illustrates a pragmatic approach to creating domain-specific QA datasets under urgent conditions and highlights challenges in transferring existing models into rapidly evolving contexts. The methodology and results can inform other urgent, domain-specific adaptations in NLP, fostering discussions on accelerating the creation of evaluation resources in response to evolving global events.

PDF Markdown

Related Papers

YouTube

Show All Videos