DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Published 1 Mar 2019 in cs.CL | (1903.00161v2)

Abstract: Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new English reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 96k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on our generalized accuracy metric, while expert human performance is 96.0%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 47.0% F1.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (809)

View on Semantic Scholar

Summary

The paper presents DROP, a benchmark that evaluates reading comprehension systems on discrete reasoning tasks including arithmetic, counting, and sorting.
It exposes significant gaps in state-of-the-art models, with baseline F1 scores as low as 28.85% compared to human performance of 96.42%.
The paper introduces NAQANet, a hybrid model combining neural and symbolic methods to boost performance on complex reasoning tasks.

Overview of DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

The paper, "DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs," introduces the DROP benchmark aimed at advancing the capabilities of reading comprehension systems. Despite progress in the field, existing benchmarks often fail to comprehensively challenge systems, particularly in discrete reasoning. Hence, DROP incorporates questions that necessitate a deeper understanding and more complex reasoning over paragraphs, such as arithmetic operations, sorting, counting, and coreference resolution.

Core Contributions

The paper highlights several key contributions, focusing on the development of DROP, its challenging nature, and the baseline results on the dataset:

Dataset Composition: The DROP benchmark consists of 96,567 questions extracted from various Wikipedia passages, particularly emphasizing narratives with numerous numerical references like sports summaries and historical descriptions. The dataset requires more complex operations than previous datasets, ensuring a comprehensive test of the reading comprehension systems' capabilities.
Question Diversity: DROP's questions encompass various discrete reasoning tasks, including addition, subtraction, counting, sorting, and comparisons. This diversity ensures that the benchmark tests multiple aspects of reading comprehension, pushing the state-of-the-art methods to extend beyond surface-level understanding.
Baseline Performance: The application of state-of-the-art models to the DROP dataset reveals significant performance gaps. For instance, models like BiDAF achieve only a 28.85% F1 score, while human performance stands impressively at 96.42%. This stark contrast underscores the dataset's difficulty and the existing models' limitations.
Introduction of NAQANet: The authors present NAQANet, a novel model combining standard reading comprehension techniques with simple numerical reasoning. NAQANet achieves a significant performance boost, reaching 47.0% F1 on the DROP benchmark, indicating the potential for hybrid models that integrate neural and symbolic methods.

Methodology

Data Collection and Validation

The dataset was curated using a meticulous process to ensure question complexity. Passages were first selected based on their numerical content and narrative richness. Subsequently, crowd workers were encouraged to create challenging questions, factoring in adversarial baselines to avoid easily answerable questions. The final dataset was validated and split into training, development, and test sets, with additional annotations to ensure quality and reliability.

Discrete Reasoning Types

The questions in DROP span multiple reasoning types:

Arithmetic Operations: Involving addition, subtraction, and other numerical computations.
Comparisons and Sorting: Requiring the systems to compare quantities or sort items based on specific attributes.
Counting: Tasks that involve counting occurrences of entities or events within the passage.

Baseline Models

The authors tested several baseline models, including:

BiDAF: A bidirectional attention flow model.
QANet: A convolutional model that avoids recurrence, showing superior performance on other RC tasks.
BERT: A transformer-based pre-trained model showing impressive results across various NLP tasks.

Despite their established efficacy, these models struggled with DROP, indicating the benchmark's challenge.

Implications and Future Directions

The introduction of DROP sets a new standard for reading comprehension benchmarks, highlighting the necessity for more advanced reasoning capabilities in NLP systems. The findings suggest several directions for future research:

Enhanced Reasoning Models: There is a need for models that can integrate discrete symbolic reasoning with neural architectures to perform complex operations effectively.
Fine-grained Evaluation Metrics: Development of more nuanced evaluation metrics that can account for partial successes in complex reasoning tasks.
Cross-domain Generalization: Exploring model performance across various domains to ensure robustness and adaptability.

Conclusion

The DROP benchmark represents a significant step forward in evaluating reading comprehension systems' ability to handle complex reasoning tasks. The paper provides a comprehensive overview of the dataset, demonstrates current models' limitations, and proposes a novel approach with NAQANet. The insights from DROP are invaluable in pushing the boundaries of what reading comprehension systems can achieve, laying the groundwork for more sophisticated NLP applications in the future.

Markdown Report Issue