Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark
(2010.07676)Abstract
Recent studies show that crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts. Models utilizing these superficial clues gain mirage advantages on the in-domain testing set, which makes the evaluation results over-estimated. The lack of trustworthy evaluation settings and benchmarks stalls the progress of NLI research. In this paper, we propose to assess a model's trustworthy generalization performance with cross-datasets evaluation. We present a new unified cross-datasets benchmark with 14 NLI datasets, and re-evaluate 9 widely-used neural network-based NLI models as well as 5 recently proposed debiasing methods for annotation artifacts. Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.
We're not able to analyze this paper right now due to high demand.
Please check back later (sorry!).
Generate a summary of this paper on our Pro plan:
We ran into a problem analyzing this paper.