Interpretable Deep Learning: Interpretation, Interpretability, Trustworthiness, and Beyond (2103.10689v3)

Published 19 Mar 2021 in cs.LG

Abstract: Deep neural networks have been well-known for their superb handling of various machine learning and artificial intelligence tasks. However, due to their over-parameterized black-box nature, it is often difficult to understand the prediction results of deep models. In recent years, many interpretation tools have been proposed to explain or reveal how deep models make decisions. In this paper, we review this line of research and try to make a comprehensive survey. Specifically, we first introduce and clarify two basic concepts -- interpretations and interpretability -- that people usually get confused about. To address the research efforts in interpretations, we elaborate the designs of a number of interpretation algorithms, from different perspectives, by proposing a new taxonomy. Then, to understand the interpretation results, we also survey the performance metrics for evaluating interpretation algorithms. Further, we summarize the current works in evaluating models' interpretability using "trustworthy" interpretation algorithms. Finally, we review and discuss the connections between deep models' interpretations and other factors, such as adversarial robustness and learning from interpretations, and we introduce several open-source libraries for interpretation algorithms and evaluation approaches.

Citations (258)

View on Semantic Scholar

Summary

The paper presents a comprehensive survey that clarifies the distinct concepts of interpretation and interpretability in deep learning.
It introduces a taxonomy categorizing interpretation algorithms by representation, model type, and their relation to the model.
The survey evaluates methods such as perturbation-based tests and Benchmarking Attribution Methods to improve AI reliability.

Overview of "Interpretable Deep Learning: Interpretation, Interpretability, Trustworthiness, and Beyond"

The paper "Interpretable Deep Learning: Interpretation, Interpretability, Trustworthiness, and Beyond" offers a comprehensive survey of the state of research in the field of interpretable deep learning. The authors systematically review the diverse methods developed for interpreting deep learning models, elucidating the core concepts and the existing tools. The paper aims to address the "black-box" problem associated with deep neural networks and the difficulty of understanding their prediction results.

Clarification of Core Concepts

The authors initiate the discussion by distinguishing between the often-confused terms: "interpretations" and "interpretability". Interpretations refer to the specific insights or explanations produced by interpretation algorithms about how deep models reach decisions. In contrast, interpretability is a model's inherent property that indicates how understandable the model's inferences are to humans. The paper further introduces a taxonomy to classify interpretation algorithms based on different dimensions, such as representations, targeted model types, and their relations to the models.

Taxonomy and Evaluation Criteria

The proposed taxonomy includes three dimensions:

Representation of Interpretations: This includes input feature importance, model responses in specific scenarios, model rationale processes, and analyses of datasets.
Model Type: This dimension classifies whether an interpretation algorithm is model-agnostic or tailored to specific architectures, like CNNs or GANs.
Relation between Interpretation and Model: This assesses whether the algorithm generates explanations via direct composition, reliance on closed-form solutions, dependency on model specifics, or through proxy models.

The paper also emphasizes the importance of "trustworthiness" in interpretation algorithms, which ensures that the produced interpretations accurately reflect the model's decision-making process rather than producing misleading or human-driven explanations.

Evaluation of Interpretation Algorithms and Model Interpretability

The paper provides a detailed survey of different evaluation methodologies for interpreting algorithms, focusing on ensuring trustworthiness. These include perturbation-based evaluations, parameter randomization, and novel methods like Benchmarking Attribution Methods (BAM).

For evaluating model interpretability, methods like Network Dissection and Pointing Game are discussed, which gauge a model's interpretability by comparing generated interpretations with human-annotated concept labels or through model performance on out-of-distribution data.

Broader Implications and Future Directions

The survey highlights the impact of interpretability on understanding deep learning's robustness and vulnerability, particularly regarding adversarial robustness. The paper suggests that improved interpretability not only enhances model reliability but also aids in refining models by learning from interpretation results. Additionally, the introduction of open-source libraries indicates a trend towards democratizing tools for interpreting AI models, thus fostering greater transparency and fostering responsible AI development.

Conclusion

This paper represents a thorough consolidation of existing research endeavors in the domain of interpretable deep learning. It provides valuable insights and a structured approach to understanding how different interpretation methods can be classified and evaluated. Future research can leverage this framework to enhance model transparency, deepen understanding of model behaviors, and ultimately lead to more reliable AI systems. This work is instrumental for researchers aiming to bridge the gap between complex neural models and human interpretability, guiding further advancements in the field.

PDF Markdown