Emergent Mind

Semantic Sensitivities and Inconsistent Predictions: Measuring the Fragility of NLI Models

(2401.14440)
Published Jan 25, 2024 in cs.CL , cs.AI , cs.CY , and cs.LG

Abstract

Recent studies of the emergent capabilities of transformer-based Natural Language Understanding (NLU) models have indicated that they have an understanding of lexical and compositional semantics. We provide evidence that suggests these claims should be taken with a grain of salt: we find that state-of-the-art Natural Language Inference (NLI) models are sensitive towards minor semantics preserving surface-form variations, which lead to sizable inconsistent model decisions during inference. Notably, this behaviour differs from valid and in-depth comprehension of compositional semantics, however does neither emerge when evaluating model accuracy on standard benchmarks nor when probing for syntactic, monotonic, and logically robust reasoning. We propose a novel framework to measure the extent of semantic sensitivity. To this end, we evaluate NLI models on adversarially generated examples containing minor semantics-preserving surface-form input noise. This is achieved using conditional text generation, with the explicit condition that the NLI model predicts the relationship between the original and adversarial inputs as a symmetric equivalence entailment. We systematically study the effects of the phenomenon across NLI models for $\textbf{in-}$ and $\textbf{out-of-}$ domain settings. Our experiments show that semantic sensitivity causes performance degradations of $12.92\%$ and $23.71\%$ average over $\textbf{in-}$ and $\textbf{out-of-}$ domain settings, respectively. We further perform ablation studies, analysing this phenomenon across models, datasets, and variations in inference and show that semantic sensitivity can lead to major inconsistency within model predictions.

Overview

  • The paper challenges the perceived comprehension abilities of transformer-based Natural Language Inference models, revealing significant sensitivity to variations in text that maintain semantic meaning.

  • A systematic framework is deployed to measure semantic sensitivity using LLMs, demonstrating performance degradation in models when evaluating minor variations of hypothesis statements.

  • An examination of a variety of transformer architectures across multiple NLI datasets reveals this semantic sensitivity issue is common and not dependent on model size or training domain.

  • The study finds that distilled models are more sensitive to semantic variation than their larger counterparts, suggesting a potential loss in the understanding of compositional semantics during the distillation process.

  • The research highlights inconsistencies in predictions and fluctuating confidence levels in NLI models, leading to questions about their reliability in tasks that require nuanced understanding of semantic structures.

Introduction

Transformer-based Language Models (LMs) have shifted the landscape of Natural Language Understanding (NLU), with performance benchmarks suggesting a high capability for syntactic, logical, and semantic comprehension. This paper presents evidence that such claims may be overstated, as state-of-the-art Natural Language Inference (NLI) models demonstrate significant sensitivity to minor, semantics-preserving variations in surface form. This suggests that the models' deep comprehension of compositional semantics might be an illusion masked by their performance on standard benchmarks.

Semantic Sensitivity of NLI Models

The study introduces a systematic framework to measure semantic sensitivity by utilizing LLMs to generate minor variations of hypothesis statements that maintain semantic equivalence. When these generated statements are evaluated against the original premise, significant changes in the models' original predictions are observed. This inconsistency occurs despite the models having previously identified correct relations between the premise and the original hypothesis. Strikingly, model performance shows an average degradation of 12.92% and 23.71% in both in-domain and out-of-domain settings, respectively.

Investigating Model Performance Across Datasets and Architectures

The paper's approach investigates a spectrum of transformer architectures, including RoBERTa, BART, DeBERTa, and DistilBart, across multiple NLI datasets. The findings suggest a pervasive issue of semantic sensitivity that is apparently independent of model size or training domain. Interestingly, when comparing distilled models to their larger counterparts, the distilled versions exhibit higher sensitivity to semantic variation, suggesting knowledge of compositional semantics is not robustly transferred during distillation.

Impact on Predictive Consistency and Implications

Further analysis indicates that the semantic sensitivity leads not only to performance degradation but also to inconsistencies within predictions. Evaluations show how models demonstrate fluctuating confidence and a tendency to make contradictory decisions when faced with semantically equivalent variations. This affects the model's robustness and calls into question their reliability for tasks requiring an understanding of nuanced semantic structure.

Conclusion

This research positions itself as a critical reflective on the presumed comprehension abilities of transformer-based NLI models. While the models excel on standard benchmarks, their understanding of semantic subtleties is proved to be more ambiguous and less robust than previously thought. This study stands as a call for more rigorous testing methods that engage the finer points of language comprehension, beyond the blunt instruments of current benchmarks, to truly ascertain the capabilities of LMs in semantic understanding.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.