Post-hoc Interpretability for Neural NLP: A Survey (2108.04840v5)

Published 10 Aug 2021 in cs.CL, cs.LG, and cs.NE

Abstract: Neural networks for NLP are becoming increasingly complex and widespread, and there is a growing concern if these models are responsible to use. Explaining models helps to address the safety and ethical concerns and is essential for accountability. Interpretability serves to provide these explanations in terms that are understandable to humans. Additionally, post-hoc methods provide explanations after a model is learned and are generally model-agnostic. This survey provides a categorization of how recent post-hoc interpretability methods communicate explanations to humans, it discusses each method in-depth, and how they are validated, as the latter is often a common concern.

Citations (201)

View on Semantic Scholar

Summary

The paper presents a comprehensive survey categorizing local, class, and global interpretability methods for neural NLP.
It evaluates post-hoc techniques such as LIME, SHAP, and SP-LIME to assess explanation fidelity and model robustness.
The study underscores the need for balanced human-grounded and functional evaluations to advance transparent and accountable AI.

Post-hoc Interpretability for Neural NLP: A Survey

In their paper titled "Post-hoc Interpretability for Neural NLP: A Survey," Andreas Madsen et al. provide a comprehensive examination of various methods for post-hoc interpretability in neural NLP. As neural networks become integral to NLP, the need for interpretable models is increasingly crucial due to concerns regarding model accountability, safety, and ethical use in decision-making.

Overview of Post-hoc Interpretability

The survey categorizes interpretability methods based on how they communicate explanations and the degree of abstraction involved. The primary focus is on model-agnostic, post-hoc techniques that provide explanations after model training, contrasting with intrinsic methods where models are designed to be interpretable by nature.

Key Methods and Their Classifications

Local Explanations: These methods focus on explaining individual predictions.
- Input Features: Methods like LIME and SHAP evaluate the importance of input features in determining model predictions. They differ mainly in their approach to calculating these importances, with LIME utilizing local surrogate models and SHAP employing Shapley values for a more theoretically grounded explanation.
- Adversarial Examples: Techniques such as HotFlip explore how model predictions change with minimal perturbations to the input, offering insights into model robustness against adversarial attacks.
Class Explanations: These methods offer explanations regarding specific output classes.
- Concepts: Natural Indirect Effect (NIE) underscores the relationship between latent representations and specific concepts, such as biases, via causal mediation analysis.
Global Explanations: Aimed at summarizing model behaviors across the entire dataset.
- Vocabulary: Techniques like projecting word embeddings facilitate understanding of how semantic relationships are encoded.
- Ensemble: Submodular Pick LIME (SP-LIME) aggregates local explanations to provide insights into the model's representation of data diversity.
- Lingustic Information: Probing tasks ascertain the extent to which neural models capture syntactic and semantic features, providing a lens into internal language understanding mechanisms.

Evaluation of Interpretability Methods

The paper emphasizes the importance of assessing interpretability methods on two fronts: human-groundedness—how well explanations support human understanding—and functional-groundedness—the fidelity of explanations in reflecting the model's actual operations. It is noted that despite increasing research, there is limited consensus on standard evaluation metrics, particularly for human-groundedness, which calls for more robust user-centered studies.

Implications and Future Directions

The survey highlights that while many methods exist, their applicability and effectiveness can vastly differ. As neural NLP models become more complex, bridging technical advancements with user-centered interpretability will be paramount. The authors suggest further research into creating more holistic interpretability benchmarks that encompass diverse linguistic phenomena and address challenges like model biases and ethical compliance.

Given the increasing deployment of neural models in sensitive applications, the research underlines the necessity of developing explanations that both satisfy technical rigor and are accessible to non-expert stakeholders. Future work could also explore the integration of intrinsic and post-hoc methods to enrich understanding while maintaining adaptability across varying contexts.

This survey lays a foundation for the interpretability of neural NLP models, advocating for concerted efforts toward more transparent and accountable AI systems in the linguistics domain.

PDF Markdown