Semantic Sensitivities and Inconsistent Predictions: Measuring the Fragility of NLI Models (2401.14440v2)

Published 25 Jan 2024 in cs.CL, cs.AI, cs.CY, and cs.LG

Abstract: Recent studies of the emergent capabilities of transformer-based Natural Language Understanding (NLU) models have indicated that they have an understanding of lexical and compositional semantics. We provide evidence that suggests these claims should be taken with a grain of salt: we find that state-of-the-art Natural Language Inference (NLI) models are sensitive towards minor semantics preserving surface-form variations, which lead to sizable inconsistent model decisions during inference. Notably, this behaviour differs from valid and in-depth comprehension of compositional semantics, however does neither emerge when evaluating model accuracy on standard benchmarks nor when probing for syntactic, monotonic, and logically robust reasoning. We propose a novel framework to measure the extent of semantic sensitivity. To this end, we evaluate NLI models on adversarially generated examples containing minor semantics-preserving surface-form input noise. This is achieved using conditional text generation, with the explicit condition that the NLI model predicts the relationship between the original and adversarial inputs as a symmetric equivalence entailment. We systematically study the effects of the phenomenon across NLI models for $\textbf{in-}$ and $\textbf{out-of-}$ domain settings. Our experiments show that semantic sensitivity causes performance degradations of $12.92\%$ and $23.71\%$ average over $\textbf{in-}$ and $\textbf{out-of-}$ domain settings, respectively. We further perform ablation studies, analysing this phenomenon across models, datasets, and variations in inference and show that semantic sensitivity can lead to major inconsistency within model predictions.

References (68)

Citations (6)

View on Semantic Scholar

Summary

The paper demonstrates that minor semantic variations cause significant prediction shifts, exposing the fragility of NLI models.
It uses LLM-generated, semantically equivalent hypothesis variations to reveal performance degradations of 12.92% in-domain and 23.71% out-of-domain.
The study finds that distilled models are more sensitive to semantic changes, emphasizing the need for more robust evaluation of language comprehension.

Introduction

Transformer-based LLMs (LMs) have shifted the landscape of Natural Language Understanding (NLU), with performance benchmarks suggesting a high capability for syntactic, logical, and semantic comprehension. This paper presents evidence that such claims may be overstated, as state-of-the-art Natural Language Inference (NLI) models demonstrate significant sensitivity to minor, semantics-preserving variations in surface form. This suggests that the models' deep comprehension of compositional semantics might be an illusion masked by their performance on standard benchmarks.

Semantic Sensitivity of NLI Models

The paper introduces a systematic framework to measure semantic sensitivity by utilizing LLMs to generate minor variations of hypothesis statements that maintain semantic equivalence. When these generated statements are evaluated against the original premise, significant changes in the models' original predictions are observed. This inconsistency occurs despite the models having previously identified correct relations between the premise and the original hypothesis. Strikingly, model performance shows an average degradation of 12.92% and 23.71% in both in-domain and out-of-domain settings, respectively.

Investigating Model Performance Across Datasets and Architectures

The paper's approach investigates a spectrum of transformer architectures, including RoBERTa, BART, DeBERTa, and DistilBart, across multiple NLI datasets. The findings suggest a pervasive issue of semantic sensitivity that is apparently independent of model size or training domain. Interestingly, when comparing distilled models to their larger counterparts, the distilled versions exhibit higher sensitivity to semantic variation, suggesting knowledge of compositional semantics is not robustly transferred during distillation.

Impact on Predictive Consistency and Implications

Further analysis indicates that the semantic sensitivity leads not only to performance degradation but also to inconsistencies within predictions. Evaluations show how models demonstrate fluctuating confidence and a tendency to make contradictory decisions when faced with semantically equivalent variations. This affects the model's robustness and calls into question their reliability for tasks requiring an understanding of nuanced semantic structure.

Conclusion

This research positions itself as a critical reflective on the presumed comprehension abilities of transformer-based NLI models. While the models excel on standard benchmarks, their understanding of semantic subtleties is proved to be more ambiguous and less robust than previously thought. This paper stands as a call for more rigorous testing methods that engage the finer points of language comprehension, beyond the blunt instruments of current benchmarks, to truly ascertain the capabilities of LMs in semantic understanding.

PDF Markdown

Tweets

https://twitter.com/IAugenstein/status/1769710616372011048

https://twitter.com/_kire_kara_/status/1755245869522104720

https://twitter.com/LChoshen/status/1752005930881822928

https://twitter.com/_kire_kara_/status/1842652876188864716