Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark (2010.07676v1)

Published 15 Oct 2020 in cs.CL and cs.AI

Abstract: Recent studies show that crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts. Models utilizing these superficial clues gain mirage advantages on the in-domain testing set, which makes the evaluation results over-estimated. The lack of trustworthy evaluation settings and benchmarks stalls the progress of NLI research. In this paper, we propose to assess a model's trustworthy generalization performance with cross-datasets evaluation. We present a new unified cross-datasets benchmark with 14 NLI datasets, and re-evaluate 9 widely-used neural network-based NLI models as well as 5 recently proposed debiasing methods for annotation artifacts. Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark (2010.07676v1)

Summary

Related Papers