Quasar: Datasets for Question Answering by Search and Reading

Published 12 Jul 2017 in cs.CL, cs.IR, and cs.LG | (1707.03904v2)

Abstract: We present two new large-scale datasets aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. The Quasar-S dataset consists of 37000 cloze-style (fill-in-the-gap) queries constructed from definitions of software entity tags on the popular website Stack Overflow. The posts and comments on the website serve as the background corpus for answering the cloze questions. The Quasar-T dataset consists of 43000 open-domain trivia questions and their answers obtained from various internet sources. ClueWeb09 serves as the background corpus for extracting these answers. We pose these datasets as a challenge for two related subtasks of factoid Question Answering: (1) searching for relevant pieces of text that include the correct answer to a query, and (2) reading the retrieved text to answer the query. We also describe a retrieval system for extracting relevant sentences and documents from the corpus given a query, and include these in the release for researchers wishing to only focus on (2). We evaluate several baselines on both datasets, ranging from simple heuristics to powerful neural models, and show that these lag behind human performance by 16.4% and 32.1% for Quasar-S and -T respectively. The datasets are available at https://github.com/bdhingra/quasar .

Abstract PDF Upgrade to Chat

Authors (3)

Citations (179)

View on Semantic Scholar

Summary

The paper presents innovative datasets that combine retrieval and reading comprehension to benchmark QA systems.
It defines Quasar-S and Quasar-T with 37K cloze queries and 43K trivia questions, addressing domain-specific and open-domain challenges.
Baseline evaluations reveal substantial gaps between automated models and human performance, highlighting the need for improved QA techniques.

Evaluation of QUASAR: Datasets for Question Answering by Search and Reading

The paper at hand introduces two significant datasets, Quasar-S and Quasar-T, both designed to advance current research in the field of Question Answering (QA) by not only emphasizing the comprehension of natural language queries but also the efficient extraction of answers from a massive corpus of text. Developed by Dhingra et al., the datasets are structured to confront the dual challenges of text retrieval and reading comprehension, facilitating holistic approaches in QA system development.

Datasets and Task Definition:

Quasar-S comprises 37,000 cloze-style queries. These are fill-in-the-gap questions sourced predominantly from the definitions of software entity tags on Stack Overflow, with Stack Overflow posts and comments forming the background corpus.
Quasar-T, meanwhile, includes 43,000 open-domain trivia questions, with ClueWeb09 as the corpus from which answers are to be extracted.

The task posed by these datasets embodies the comprehensive challenge in QA systems of both locating relevant passages and extracting the correct answer, thereby bridging the gap between searching and understanding text. The datasets incorporate tags and structured questions to promote domain-specific research, especially within Quasar-S, which is confined to software-related contexts.

Evaluation and Baselines:

The research evaluates various baseline models on the datasets, ranging from simple heuristics to sophisticated neural models, underscoring the current performance limitations when compared with human baselines. Notably:

Human experts achieved a performance of 50% for Quasar-S and 60.6% for Quasar-T in open-book settings, illustrating the challenging nature of these datasets.
Baseline systems, such as maximum frequency and word distance heuristics, exhibit significant performance gaps from human-level accuracy. Particularly, the GA Reader and BiDAF models, prominent in their domains for reading comprehension, fall short by notable margins, indicating the datasets' complexity.

The authors emphasize the relative failure of current automated systems in matching human-level comprehension, with Quasar-S and Quasar-T lagging by 16.4% and 32.1% against human performance, respectively.

Implications and Future Directions:

The introduction of these datasets has considerable implications for the development and evaluation of QA systems, particularly in handling unstructured data sources. The necessity for QA systems to integrate advanced retrieval methods with comprehension strategies is pivotal. This integration holds significant potential for domains requiring precise knowledge extraction, such as software engineering, highlighted by the Quasar-S dataset.

Moreover, the study suggests potential avenues for future exploration. Enhancements in both retrieval and deep learning models can leverage the datasets for training purposes, thereby improving the joint performance of retrieval and reading tasks. Continued research may also focus on optimizing the balance identified between search accuracy and reading accuracy to develop robust QA pipelines.

In summary, the Quasar datasets serve as a comprehensive platform for fostering innovation in automated QA systems, reflecting both the complexity present in large corpora and contemporary research challenges. They provide a benchmark for evaluating methodologies that address the nuanced demands of open-domain and domain-specific knowledge extraction.

Markdown Report Issue