NewsQA: A Machine Comprehension Dataset

Published 29 Nov 2016 in cs.CL and cs.AI | (1611.09830v3)

Abstract: We present NewsQA, a challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs. Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN, with answers consisting of spans of text from the corresponding articles. We collect this dataset through a four-stage process designed to solicit exploratory questions that require reasoning. A thorough analysis confirms that NewsQA demands abilities beyond simple word matching and recognizing textual entailment. We measure human performance on the dataset and compare it to several strong neural models. The performance gap between humans and machines (0.198 in F1) indicates that significant progress can be made on NewsQA through future research. The dataset is freely available at https://datasets.maluuba.com/NewsQA.

Abstract PDF Upgrade to Chat

Citations (863)

View on Semantic Scholar

Summary

The paper presents NewsQA, a dataset of over 100,000 question-answer pairs that demand advanced reasoning beyond simple lexical matching.
Its four-stage construction from article curation to rigorous validation ensures a diverse and complex set of comprehension challenges.
Evaluation reveals a notable gap between human performance and current neural models, highlighting opportunities to improve machine comprehension.

NewsQA: A Machine Comprehension Dataset

The paper introduces NewsQA, a substantial dataset for machine comprehension (MC), comprising over 100,000 crowd-sourced question-answer pairs derived from CNN news articles. This work holds significance due to its scale, the diversity and complexity of questions, and the range of reasoning required to provide accurate answers—a notable departure from existing MC datasets.

Dataset Construction and Characteristics

The construction of NewsQA involves a meticulous four-stage process: article curation, question sourcing, answer sourcing, and validation. Articles were randomly selected from CNN's archives to ensure a broad topic coverage. Questioners formulated questions based only on an article's headline and summary points, promoting curiosity and minimizing straightforward word-matching. Answerers, a separate set of crowdworkers, then validated these questions by marking the respective answer spans in the full article text. A validation phase followed, ensuring quality by having additional workers choose the best answer from the provided options or reject incorrect ones.

Several characteristics distinguish NewsQA from previous datasets:

Answers are text spans of arbitrary length within the article.
Some questions have no answer in the text.
No candidate answers are provided, increasing complexity.
Collection methodology promotes lexical and syntactic divergence and requires a significant proportion of reasoning.

These attributes contribute to the dataset's difficulty, making it a valuable benchmark for advancing MC research.

Comparison with Existing Datasets

The paper situates NewsQA in the landscape of existing MC datasets, including MCTest, CNN/Daily Mail, CBT, BookTest, and SQuAD. Compared to these, NewsQA stands out in several ways:

MCTest offers rich, reasoning-required questions but is too small for deep learning approaches.
CNN/Daily Mail and CBT have large synthetic datasets that may not rigorously test reasoning and comprehension.
SQuAD shares similarities with NewsQA in using human-generated questions and non-synthetic text spans but is less challenging due to higher word-matching rates and simpler reasoning requirements.

Analysis and Evaluation

The paper's analysis demonstrates the comprehensive and challenging nature of NewsQA. Answer types include a range of linguistic structures such as clauses and entities, underpinning the dataset's complexity. Moreover, an assessment of reasoning types shows that a significant portion requires synthesis and inference, beyond simple word matching or paraphrasing.

A striking finding is the substantial gap between human performance and existing neural models (e.g., match-LSTM and BARB). Human evaluators achieved an F1 score of 0.694, whereas models lagged significantly, with the best model reaching only 0.500 F1. This performance differential underscores the dataset's potential to drive advances in MC by highlighting the need for improved comprehension algorithms capable of complex reasoning.

Implications and Future Research Directions

The introduction of NewsQA has several implications for the field of MC:

The variety and complexity of questions promote the development of models that can perform advanced reasoning and information synthesis.
The validation gap highlights areas where current models fall short, paving the way for targeted improvements in algorithm design and training methodologies.
The dataset's scale makes it suitable for training data-intensive deep learning models while maintaining task complexity that simulates real-world comprehension scenarios.

Future research might focus on several avenues:

Improved Model Architectures: Enhancing existing models or developing new ones specifically aimed at tackling reasoning and synthesis.
Transfer Learning: Utilizing pre-trained models on complementary tasks to boost performance on NewsQA.
Human-Informed Evaluation Metrics: Developing evaluation frameworks that better capture the nuances of human comprehension, given the limitations of metrics like F1 and exact match.

In conclusion, NewsQA represents a significant contribution to the MC domain, providing a rich, challenging dataset that bridges the gap between theoretical research and practical application. By pushing the boundaries of model capabilities, NewsQA is poised to catalyze advancements in the development of more sophisticated and intelligent MC systems.

Markdown Report Issue