A Diagnostic Study of Explainability Techniques for Text Classification (2009.13295v1)

Published 25 Sep 2020 in cs.CL and cs.LG

Abstract: Recent developments in machine learning have introduced models that approach human performance at the cost of increased architectural complexity. Efforts to make the rationales behind the models' predictions transparent have inspired an abundance of new explainability techniques. Provided with an already trained model, they compute saliency scores for the words of an input instance. However, there exists no definitive guide on (i) how to choose such a technique given a particular application task and model architecture, and (ii) the benefits and drawbacks of using each such technique. In this paper, we develop a comprehensive list of diagnostic properties for evaluating existing explainability techniques. We then employ the proposed list to compare a set of diverse explainability techniques on downstream text classification tasks and neural network architectures. We also compare the saliency scores assigned by the explainability techniques with human annotations of salient input regions to find relations between a model's performance and the agreement of its rationales with human ones. Overall, we find that the gradient-based explanations perform best across tasks and model architectures, and we present further insights into the properties of the reviewed explainability techniques.

Citations (205)

View on Semantic Scholar

Summary

The paper introduces a diagnostic framework to quantify and compare explainability techniques in text classification using measurable properties.
It empirically evaluates methods like Saliency, InputXGradient, and LIME across CNN, LSTM, and Transformer models to highlight performance differences.
Findings reveal that gradient-based methods excel in fidelity and human agreement, enhancing model interpretability for practical, sensitive applications.

Overview of A Diagnostic Study of Explainability Techniques for Text Classification

The paper "A Diagnostic Study of Explainability Techniques for Text Classification" provides a detailed analysis of explainability methods specifically in the context of text classification, examining their efficacy across various models and datasets. The authors, Atanasova et al., focus on producing a comprehensive list of diagnostic properties to evaluate these explainability techniques, thereby assessing their strengths and limitations when applied to selected machine learning models. This approach aims to inform the choice of appropriate techniques based on model architecture and application domain.

Key Contributions

Diagnostic Property Compilation: The authors present a thorough compilation of diagnostic properties for evaluating explainability techniques, ensuring these properties can be automatically measured for practical assessments. This benchmark goes beyond mere qualitative assessments and provides a quantifiable basis for comparison.
Empirical Evaluation Across Models and Tasks: The paper explores three different neural network architectures—Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), and Transformers—across three NLP tasks. These tasks include text classification datasets with human-annotated rationales, enabling consistent comparisons of model performance and explanation quality.
Human-Agreement and Faithfulness: The evaluation measures the agreement between machine-generated saliency scores and human annotations, assessing rational agreement. Additionally, the faithfulness of explanations is determined by how well they represent the model's true decision-making process.
Comprehensive Comparison: The paper compares model-agnostic and model-specific explainability methods, including Saliency, InputXGradient, Guided Backpropagation, Occlusion, Shapley Value Sampling (ShapSampl), and LIME. Gradient-based explainability methods consistently perform better across the evaluated tasks and models.

Numerical Insights and Implications

The paper finds that gradient-based explanation methods yield the best diagnostic property performance across the datasets. This finding highlights the inherent coherence of gradient-derived explanations in reflecting model decisions. Saliency and InputXGradient, particularly with L2 norm aggregation, show strong performance in reflecting human-like rationales (Mean Average Precision) and maintaining fidelity to the confidence signals from the models (Mean Absolute Error).

Conversely, the paper notes that perturbation-based methods like LIME and ShapSampl offer better insights into model confidence, though at significant computational expense. The results suggest that the transparency of the model’s rationales varies with the complexity of the architecture, indicating better performance of explainability methods in simpler, less entangled models.

Future Prospects and Applications

The analysis underlines the necessity for future improvements in explainability methods, particularly those that can retain high fidelity while offering computational efficiency. The findings hold significant implications for trustworthy AI applications in sensitive domains like healthcare, where interpretability is non-negotiable.

Researchers are encouraged to leverage these diagnostic properties to refine existing models or develop new ones, especially considering the burgeoning demand for interpretable AI solutions within regulatory frameworks mandating explanation transparency.

Overall, the paper provides a valuable resource for the research community, offering methodological rigor in the assessment of explainability methods, thereby contributing to the ongoing discourse of model interpretability within AI. These insights can enhance the practicality of deploying machine learning models in domains where understanding decision pathways is as important as the decisions themselves.

PDF Markdown