What do Models Learn from Question Answering Datasets?

Published 7 Apr 2020 in cs.CL | (2004.03490v2)

Abstract: While models have reached superhuman performance on popular question answering (QA) datasets such as SQuAD, they have yet to outperform humans on the task of question answering itself. In this paper, we investigate if models are learning reading comprehension from QA datasets by evaluating BERT-based models across five datasets. We evaluate models on their generalizability to out-of-domain examples, responses to missing or incorrect data, and ability to handle question variations. We find that no single dataset is robust to all of our experiments and identify shortcomings in both datasets and evaluation methods. Following our analysis, we make recommendations for building future QA datasets that better evaluate the task of question answering through reading comprehension. We also release code to convert QA datasets to a shared format for easier experimentation at https://github.com/amazon-research/qa-dataset-converter.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (71)

View on Semantic Scholar

Summary

The paper shows models often exploit heuristics like question-context overlap, achieving high scores without actual comprehension.
The paper finds that models fine-tuned on one QA dataset perform poorly on others, highlighting issues with cross-domain generalizability.
The paper underscores the need for more robust QA dataset designs to better evaluate true reading comprehension beyond superficial cues.

Insights into Model Learning from Question Answering Datasets

The paper "What do Models Learn from Question Answering Datasets?" by Sen and Saffari explores the subtleties of model learning in the domain of question answering (QA) datasets, particularly from popular datasets such as SQuAD. Despite achieving impressive performance metrics, models have yet to surpass human capabilities in actual question answering tasks. This study employs BERT-based models to probe the extent to which QA datasets facilitate learning reading comprehension, assessing their ability to generalize across datasets, robustness to data perturbations, and capability to handle question variations.

The investigation is methodical, evaluating five distinct QA datasets: SQuAD 2.0, TriviaQA, Natural Questions (NQ), QuAC, and NewsQA. Researchers tested the generalizability of models fine-tuned on specific datasets against out-of-domain examples, revealing substantial drops in performance when models encountered new datasets, which suggests limited generalizability. Notably, simpler mechanisms such as question-context overlap or named entity extraction seem to bolster model performance without genuine comprehension, as demonstrated by high performance despite randomized training labels or shuffled context sentences. These experiments underscore the gap between model success on test sets and effective reading comprehension.

The study also examines models' dynamic response to question variations. Results indicate a deficiency in handling filler words or negation, adding complexity to QA tasks. Particularly, the SQuAD dataset showed performance drop not due to linguistic understanding but perhaps due to annotation biases or artifacts regarding negation.

The implications extend into both practical and theoretical realms. Practically, the findings imply reconsideration of QA dataset construction to avoid easier heuristics models misuse, suggesting that datasets should include varied question formulations, should be tested across multiple datasets, and must re-examine annotation methodologies to mitigate inherent biases.

Theoretically, this inquiry contributes to understanding the intricacies of model training and evaluation, offering insights into the limited robustness of current approaches and inspiring future developments toward augmenting model capabilities beyond statistical learning to genuine comprehension. Looking forward, the paper recommends standardized formats for dataset creation to allow simplified cross-dataset comparison and evaluation, urging the community to challenge models with questions as robust as those they are likely to encounter in real-world settings.

To conclude, this research highlights the disjunction between traditional performance measures and authentic understanding, urging for a paradigm shift in how QA datasets are crafted and utilized — an essential step in driving advancements in AI.

Markdown Report Issue