Explaining NLP Models via Minimal Contrastive Editing (MiCE) (2012.13985v2)

Published 27 Dec 2020 in cs.CL and cs.AI

Abstract: Humans have been shown to give contrastive explanations, which explain why an observed event happened rather than some other counterfactual event (the contrast case). Despite the influential role that contrastivity plays in how humans explain, this property is largely missing from current methods for explaining NLP models. We present Minimal Contrastive Editing (MiCE), a method for producing contrastive explanations of model predictions in the form of edits to inputs that change model outputs to the contrast case. Our experiments across three tasks--binary sentiment classification, topic classification, and multiple-choice question answering--show that MiCE is able to produce edits that are not only contrastive, but also minimal and fluent, consistent with human contrastive edits. We demonstrate how MiCE edits can be used for two use cases in NLP system development--debugging incorrect model outputs and uncovering dataset artifacts--and thereby illustrate that producing contrastive explanations is a promising research direction for model interpretability.

Citations (113)

View on Semantic Scholar

Summary

The paper introduces Minimal Contrastive Editing (MiCE), a novel two-stage method that generates explanations for NLP models by finding minimal text edits that change the model's prediction.
Experiments across sentiment, topic, and QA tasks show MiCE achieves high prediction flip rates with minimal and fluent edits, demonstrating its effectiveness in generating contrastive explanations.
MiCE can help debug models and uncover dataset biases by revealing which minimal input changes affect outcomes, providing insights beyond standard interpretability methods.

Minimal Contrastive Editing for Explaining NLP Models

Natural language processing (NLP) models have demonstrated remarkable capabilities in various tasks, yet their interpretability remains a pressing concern in the AI community. The paper titled "Explaining NLP Models via Minimal Contrastive Editing (MiCE)" introduces a novel approach for enhancing model interpretability through minimal contrastive edits. These edits modify input instances just enough to alter the model's prediction, providing insights into the model's decision-making process.

Core Concept and Methodology

The authors leverage insights from cognitive science, highlighting that human explanations are inherently contrastive, meaning explanations often arise in reference to alternatives. Although this perspective is prevalent in human cognition, it is mostly absent in existing NLP interpretability methods. MiCE addresses this gap by creating contrastive explanations in the form of minimal input modifications that shift a model's output from an original prediction to a specified contrast prediction.

MiCE is a two-stage process:

Editor Fine-tuning: The first stage involves fine-tuning a Text-to-Text Transfer Transformer (T5) model, termed the Editor, to associate specific edits with corresponding target labels. The fine-tuning is aimed at learning how edits can connect original text instances with contrast labels effectively.
Editing and Contrastive Explanation: In the second stage, the fine-tuned Editor generates edits using beam search to iteratively refine candidates. The algorithm systematically applies gradient-based masking to identify the critical parts of text contributing to the model's predictions. These parts are then edited to achieve the desired contrast prediction.

Evaluation and Results

MiCE's efficacy was validated across tasks including binary sentiment classification (IMDB), topic classification (Newsgroups), and multiple-choice question answering (RACE), showing high flip rates for predictions with minimal and fluent edits. The experiments on three datasets illustrate that MiCE effectively produces contrastive explanations that are both minimal in nature and fluent linguistically, with flip rates approaching 100% for two out of three datasets.

The analysis further explored how MiCE edits compare to human-generated contrastive edits and demonstrated potential benefits in debugging model predictions and uncovering dataset artifacts, such as biases assimilated by models during training.

Implications and Future Directions

The implications of MiCE extend beyond explanations. By facilitating understanding of model behavior, MiCE can aid developers in debugging and enhancing model reliability. Its ability to reveal dataset artifacts underscores the broader utility of contrastive explanations in identifying and correcting biases within datasets and models.

Looking forward, one consideration is addressing MiCE's computational needs, notably in fine-tuning and iterative edit searching. Optimizing the efficiency of the search process remains an open challenge that could further enhance the method's applicability. Moreover, exploring the integration of MiCE with active learning strategies or model conditioning to refine outputs iteratively may offer valuable insights.

MiCE represents a promising step toward more interpretable NLP models, aligning computational explanations with human cognitive patterns. As interpretability gains prominence amid growing reliance on AI, methods such as MiCE that offer intuitive and user-centered explanations will be instrumental in bridging the gap between complex model outputs and accessible human understanding.

Related Papers

GitHub

GitHub - allenai/mice (26 stars)