Towards Faithful Model Explanation in NLP: A Survey (2209.11326v4)
Abstract: End-to-end neural NLP models are notoriously difficult to understand. This has given rise to numerous efforts towards model explainability in recent years. One desideratum of model explanation is faithfulness, i.e. an explanation should accurately represent the reasoning process behind the model's prediction. In this survey, we review over 110 model explanation methods in NLP through the lens of faithfulness. We first discuss the definition and evaluation of faithfulness, as well as its significance for explainability. We then introduce recent advances in faithful explanation, grouping existing approaches into five categories: similarity-based methods, analysis of model-internal structures, backpropagation-based methods, counterfactual intervention, and self-explanatory models. For each category, we synthesize its representative studies, strengths, and weaknesses. Finally, we summarize their common virtues and remaining challenges, and reflect on future work directions towards faithful explainability in NLP.
- Abnar, Samira and Willem Zuidema. 2020. Quantifying Attention Flow in Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Association for Computational Linguistics, Online.
- CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior. ArXiv:2205.14140 [cs].
- Sanity Checks for Saliency Maps. In Advances in Neural Information Processing Systems, volume 31, Curran Associates, Inc.
- Debugging Tests for Model Explanations. In Advances in Neural Information Processing Systems, volume 33, pages 700–712, Curran Associates, Inc.
- Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks. arXiv:1608.04207 [cs]. ArXiv: 1608.04207.
- Alvarez Melis, David and Tommi Jaakkola. 2018. Towards Robust Interpretability with Self-Explaining Neural Networks. In Advances in Neural Information Processing Systems, volume 31, Curran Associates, Inc.
- Alvarez-Melis, David and Tommi S. Jaakkola. 2018. On the Robustness of Interpretability Methods. arXiv:1806.08049 [cs, stat]. ArXiv: 1806.08049.
- Naturalistic Causal Probing for Morpho-Syntax. ArXiv:2205.07043 [cs].
- Learning to Compose Neural Networks for Question Answering. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1545–1554, Association for Computational Linguistics, San Diego, California.
- Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48.
- VQA: Visual Question Answering. pages 2425–2433.
- Explaining Predictions of Non-Linear Classifiers in NLP. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 1–7, Association for Computational Linguistics, Berlin, Germany.
- Explaining Recurrent Neural Network Predictions in Sentiment Analysis. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 159–168, Association for Computational Linguistics, Copenhagen, Denmark.
- A Diagnostic Study of Explainability Techniques for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3256–3274, Association for Computational Linguistics, Online.
- On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE, 10(7):e0130140. Publisher: Public Library of Science.
- How to Explain Individual Classification Decisions. Journal of Machine Learning Research, page 29.
- Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, pages 1–16, Association for Computing Machinery, New York, NY, USA.
- Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58:82–115.
- Interpretable Neural Predictions with Differentiable Binary Variables. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2963–2977, Association for Computational Linguistics, Florence, Italy.
- "Will You Find These Shortcuts?" A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification. arXiv:2111.07367 [cs]. ArXiv: 2111.07367.
- Bastings, Jasmijn and Katja Filippova. 2020. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? arXiv:2010.05607 [cs]. ArXiv: 2010.05607.
- Influence Functions in Deep Learning Are Fragile.
- IDENTIFYING AND CONTROLLING IMPORTANT NEURONS IN NEURAL MACHINE TRANSLATION. page 19.
- On the Linguistic Representational Power of Neural Machine Translation Models. Computational Linguistics, 46(1):1–52. Place: Cambridge, MA Publisher: MIT Press.
- Belinkov, Yonatan and James Glass. 2019. Analysis Methods in Neural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 7:49–72.
- Latent Compositional Representations Improve Systematic Generalization in Grounded Question Answering. Transactions of the Association for Computational Linguistics, 9:195–210. Place: Cambridge, MA Publisher: MIT Press.
- On the Opportunities and Risks of Foundation Models. arXiv:2108.07258 [cs]. ArXiv: 2108.07258.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Association for Computational Linguistics, Lisbon, Portugal.
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, Curran Associates, Inc.
- Natural Language Multitasking: Analyzing and Improving Syntactic Saliency of Hidden Representations. ArXiv:1801.06024 [cs, stat].
- DoCoGen: Domain Counterfactual Generation for Low Resource Domain Adaptation. ArXiv:2202.12350 [cs].
- e-SNLI: Natural Language Inference with Natural Language Explanations. In Advances in Neural Information Processing Systems, volume 31, Curran Associates, Inc.
- Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
- Case-based explanation of non-case-based learning methods. Proceedings of the AMIA Symposium, pages 212–215.
- RNNbow: Visualizing Learning Via Backpropagation Gradients in RNNs. IEEE Computer Graphics and Applications, 38(6):39–50. Conference Name: IEEE Computer Graphics and Applications.
- Caucheteux, Charlotte and Jean-Rémi King. 2022. Brains and algorithms partially converge in natural language processing. Communications Biology, 5(1):134.
- A Comparative Study of Faithfulness Metrics for Model Interpretability Methods. Technical Report arXiv:2204.05514, arXiv. Issue: arXiv:2204.05514 arXiv:2204.05514 [cs] type: article.
- Transformer Interpretability Beyond Attention Visualization. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 782–791. ISSN: 2575-7075.
- REV: Information-Theoretic Evaluation of Free-Text Rationales. ArXiv:2210.04982 [cs].
- Learning to Explain: An Information-Theoretic Perspective on Model Interpretation. In Proceedings of the 35th International Conference on Machine Learning, pages 883–892, PMLR. ISSN: 2640-3498.
- Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. ArXiv:2211.12588 [cs].
- Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1477–1491, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
- Chomsky, Noam. 1965. Aspects of the theory of syntax. Aspects of the theory of syntax. M.I.T. Press, Oxford, England.
- ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS. page 18.
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. ArXiv:1803.05457 [cs].
- A Study of Automatic Metrics for the Evaluation of Natural Language Explanations. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2376–2387, Association for Computational Linguistics, Online.
- Training Verifiers to Solve Math Word Problems. ArXiv:2110.14168 [cs].
- Visualizing and Measuring the Geometry of BERT. page 15.
- What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Association for Computational Linguistics, Melbourne, Australia.
- Creswell, Antonia and Murray Shanahan. 2022. Faithful Reasoning Using Large Language Models. ArXiv:2208.14271 [cs].
- Explaining Answers with Entailment Trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
- Discovering Latent Concepts Learned in BERT. ArXiv:2205.07237 [cs].
- A survey of the state of explainable AI for natural language processing. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 447–459, Association for Computational Linguistics, Suzhou, China.
- How do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable Masking. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3243–3255, Association for Computational Linguistics, Online.
- Sparse Interventions in Language Models with Differentiable Masking. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 16–27, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
- Extraction of Salient Sentences from Labelled Documents. arXiv:1412.6815 [cs]. ArXiv: 1412.6815.
- A General-Purpose Algorithm for Constrained Sequential Inference. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 482–492, Association for Computational Linguistics, Hong Kong, China.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Association for Computational Linguistics, Minneapolis, Minnesota.
- ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, Association for Computational Linguistics, Online.
- Ding, Shuoyang and Philipp Koehn. 2021. Evaluating Saliency Methods for Neural Language Models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5034–5052, Association for Computational Linguistics, Online.
- Doshi-Velez, Finale and Been Kim. 2017. Towards A Rigorous Science of Interpretable Machine Learning. arXiv:1702.08608 [cs, stat]. ArXiv: 1702.08608.
- DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Association for Computational Linguistics, Minneapolis, Minnesota.
- Do Transformer Models Show Similar Attention Patterns to Task-Specific Human Gaze? In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4295–4309, Association for Computational Linguistics, Dublin, Ireland.
- HotFlip: White-Box Adversarial Examples for Text Classification. arXiv:1712.06751 [cs]. ArXiv: 1712.06751.
- Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175.
- Ethayarajh, Kawin. 2019. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. arXiv:1909.00512 [cs]. ArXiv: 1909.00512.
- Ethayarajh, Kawin and Dan Jurafsky. 2021. Attention Flows are Shapley Value Explanations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 49–54, Association for Computational Linguistics, Online.
- Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond. Transactions of the Association for Computational Linguistics, 10:1138–1158.
- CausaLM: Causal Model Explanation Through Counterfactual Language Models. Computational Linguistics, pages 1–54.
- Pathologies of Neural Models Make Interpretations Difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3719–3728, Association for Computational Linguistics, Brussels, Belgium.
- Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1828–1843, Association for Computational Linguistics, Online.
- PAL: Program-aided Language Models. ArXiv:2211.10435 [cs].
- Evaluating Models’ Local Decision Boundaries via Contrast Sets. arXiv:2004.02709 [cs]. ArXiv: 2004.02709.
- Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1161–1166, Association for Computational Linguistics, Hong Kong, China.
- Interpretation of neural networks is fragile. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19, pages 3681–3688, AAAI Press, Honolulu, Hawaii, USA.
- Neural Module Networks for Reasoning over Text.
- Better Hit the Nail on the Head than Beat around the Bush: Removing Protected Attributes with a Single Projection. ArXiv:2212.04273 [cs].
- Halpern, Joseph Y. and Judea Pearl. 2005. Causes and Explanations: A Structural-Model Approach. Part I: Causes. The British Journal for the Philosophy of Science, 56(4):843–887. Publisher: The University of Chicago Press.
- Is neuro-symbolic ai meeting its promises in natural language processing? a structured review. Semantic Web, page 1–42.
- Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5553–5563, Association for Computational Linguistics, Online.
- Self-Attention Attribution: Interpreting Information Interactions Inside Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 35(14):12963–12971. Number: 14.
- Harvey Friedman’s Research on the Foundations of Mathematics. Elsevier. Google-Books-ID: 2plPRR4LDxIC.
- Hase, Peter and Mohit Bansal. 2020. Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5540–5552, Association for Computational Linguistics, Online.
- Hase, Peter and Mohit Bansal. 2022. When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data. In Proceedings of the First Workshop on Learning with Natural Language Supervision, pages 29–39, Association for Computational Linguistics, Dublin, Ireland.
- Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4351–4367, Association for Computational Linguistics, Online.
- Generating Visual Explanations. In Computer Vision – ECCV 2016, Lecture Notes in Computer Science, pages 3–19, Springer International Publishing, Cham.
- Measuring Massive Multitask Language Understanding. ArXiv:2009.03300 [cs].
- Herman, Bernease. 2019. The Promise and Peril of Human Evaluation for Model Interpretability. arXiv:1711.07414 [cs, stat]. ArXiv: 1711.07414.
- Hewitt, John and Percy Liang. 2019. Designing and Interpreting Probes with Control Tasks. arXiv:1909.03368 [cs]. ArXiv: 1909.03368.
- Interpreting Word-Level Hidden State Behaviour of Character-Level LSTM Language Models. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 258–266, Association for Computational Linguistics, Brussels, Belgium.
- METGEN: A Module-Based Entailment Tree Generation Framework for Answer Explanation. ArXiv:2205.02593 [cs].
- A Benchmark for Interpretability Methods in Deep Neural Networks. In Advances in Neural Information Processing Systems, volume 32, Curran Associates, Inc.
- Learning to Reason: End-to-End Module Networks for Visual Question Answering. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 804–813, IEEE, Venice.
- Jacovi, Alon and Yoav Goldberg. 2020. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
- Jacovi, Alon and Yoav Goldberg. 2021. Aligning Faithful Interpretations with their Social Attribution. Transactions of the Association for Computational Linguistics, 9:294–310. Place: Cambridge, MA Publisher: MIT Press.
- Contrastive Explanations for Model Interpretability. arXiv:2103.01378 [cs]. ArXiv: 2103.01378.
- Jain, Sarthak and Byron C. Wallace. 2019. Attention is not Explanation. arXiv:1902.10186 [cs]. ArXiv: 1902.10186.
- Learning to Faithfully Rationalize by Construction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4459–4473, Association for Computational Linguistics, Online.
- Explaining Explanations: Axiomatic Feature Interactions for Deep Networks. Journal of Machine Learning Research, page 54.
- Cold-Start and Interpretability: Turning Regular Expressions into Trainable Recurrent Neural Networks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3193–3207, Association for Computational Linguistics, Online.
- Explore, Propose, and Assemble: An Interpretable Model for Multi-Hop Reading Comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2714–2725, Association for Computational Linguistics, Florence, Italy.
- CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1988–1997, IEEE, Honolulu, HI.
- Logic Traps in Evaluating Attribution Scores. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5911–5922, Association for Computational Linguistics, Dublin, Ireland.
- Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations. ArXiv:2205.11822 [cs].
- Representation of Linguistic Form and Function in Recurrent Neural Networks. Computational Linguistics, 43(4):761–780. Place: Cambridge, MA Publisher: MIT Press.
- Putting Words in BERT’s Mouth: Navigating Contextualized Vector Spaces with Pseudowords. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10300–10313, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
- Visualizing and Understanding Recurrent Networks. arXiv:1506.02078 [cs]. ArXiv: 1506.02078.
- BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8849–8861, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
- Learning the Difference that Makes a Difference with Counterfactually-Augmented Data. arXiv:1909.12434 [cs, stat]. ArXiv: 1909.12434.
- Kaushik, Divyansh and Zachary C. Lipton. 2018. How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks.
- Kindermans, Pieter-Jan. 2017. LEARNING HOW TO EXPLAIN NEURAL NETWORKS: PATTERNNET AND PATTERNATTRIBUTION. page 16.
- The (Un)reliability of Saliency Methods. In Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller, editors, Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, volume 11700. Springer International Publishing, Cham, pages 267–280. Series Title: Lecture Notes in Computer Science.
- Koh, Pang Wei and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning, pages 1885–1894, PMLR. ISSN: 2640-3498.
- Large Language Models are Zero-Shot Reasoners. ArXiv:2205.11916 [cs].
- Captum: A unified and generic model interpretability library for PyTorch. arXiv:2009.07896 [cs, stat]. ArXiv: 2009.07896.
- Krishnamurthy, Jayant and Thomas Kollar. 2013. Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World. Transactions of the Association for Computational Linguistics, 1:193–206. Place: Cambridge, MA Publisher: MIT Press.
- Probing Classifiers are Unreliable for Concept Removal and Detection. ArXiv:2207.04153 [cs].
- Kumar, Sawan and Partha Talukdar. 2020. NILE : Natural Language Inference with Faithful Natural Language Explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
- Let Me Explain: Impact of Personal and Impersonal Explanations on Trust in Recommender Systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, pages 1–12, Association for Computing Machinery, New York, NY, USA.
- Lakkaraju, Himabindu and Osbert Bastani. 2020. "How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 79–85, ACM, New York NY USA.
- Can language models learn from explanations in context? ArXiv:2204.02329 [cs].
- Defining Locality for Surrogates in Post-hoc Interpretablity.
- Rationalizing Neural Predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 107–117, Association for Computational Linguistics, Austin, Texas.
- The Winograd Schema Challenge. page 10.
- Solving Quantitative Reasoning Problems with Language Models. ArXiv:2206.14858 [cs].
- Evaluating Explanation Methods for Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 365–375, Association for Computational Linguistics, Online.
- Visualizing and Understanding Neural Models in NLP. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 681–691, Association for Computational Linguistics, San Diego, California.
- Understanding Neural Networks through Representation Erasure. arXiv:1612.08220 [cs]. ArXiv: 1612.08220.
- On the Advance of Making Language Models Better Reasoners. ArXiv:2206.02336 [cs].
- Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Association for Computational Linguistics, Vancouver, Canada.
- Lipton, Zachary C. 2017. The Mythos of Model Interpretability. arXiv:1606.03490 [cs, stat]. ArXiv: 1606.03490.
- Rethinking Attention-Model Explainability through Faithfulness Violation Test. In Proceedings of the 39th International Conference on Machine Learning, pages 13807–13824, PMLR. ISSN: 2640-3498.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs]. ArXiv: 1907.11692.
- Information-theoretic Probing Explains Reliance on Spurious Features.
- Influence Patterns for Explaining Information Flow in BERT.
- Lundberg, Scott M and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc.
- Faithful Chain-of-Thought Reasoning. ArXiv:2301.13379 [cs].
- Improving Neural Model Performance through Natural Language Feedback on Their Explanations. arXiv:2104.08765 [cs]. ArXiv: 2104.08765.
- Mao, Jiayuan and Chuang Gan. 2019. THE NEURO-SYMBOLIC CONCEPT LEARNER: INTERPRETING SCENES, WORDS, AND SENTENCES FROM NATURAL SUPERVISION. page 28.
- Few-Shot Self-Rationalization with Natural Language Prompts. ArXiv:2111.08284 [cs].
- Martins, Andre and Ramon Astudillo. 2016. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. In Proceedings of The 33rd International Conference on Machine Learning, pages 1614–1623, PMLR. ISSN: 1938-7228.
- Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. arXiv:1902.01007 [cs]. ArXiv: 1902.01007.
- Miller, Tim. 2018. Explanation in Artificial Intelligence: Insights from the Social Sciences. arXiv:1706.07269 [cs]. ArXiv: 1706.07269.
- Understanding Hidden Memories of Recurrent Neural Networks. In 2017 IEEE Conference on Visual Analytics Science and Technology (VAST), pages 13–24.
- Layer-Wise Relevance Propagation: An Overview. In Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller, editors, Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Lecture Notes in Computer Science. Springer International Publishing, Cham, pages 193–209.
- Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognition, 65:211–222.
- SHAP-Based Explanation Methods: A Review for NLP Interpretability. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4593–4603, International Committee on Computational Linguistics, Gyeongju, Republic of Korea.
- Causal Analysis of Syntactic Agreement Neurons in Multilingual Language Models. ArXiv:2210.14328 [cs].
- Explainable Prediction of Medical Codes from Clinical Text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1101–1111, Association for Computational Linguistics, New Orleans, Louisiana.
- Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences, 116(44):22071–22080. Publisher: Proceedings of the National Academy of Sciences.
- An Attention Matrix for Every Decision: Faithfulness-based Arbitration Among Multiple Attention-Based Interpretations of Transformers in Text Classification. ArXiv:2209.10876 [cs].
- WT5?! Training Text-to-Text Models to Explain their Predictions. arXiv:2004.14546 [cs]. ArXiv: 2004.14546.
- Model Agnostic Multilevel Explanations. In Advances in Neural Information Processing Systems, volume 33, pages 5968–5979, Curran Associates, Inc.
- A Theoretical Explanation for Perplexing Behaviors of Backpropagation-based Visualizations. In Proceedings of the 35th International Conference on Machine Learning, pages 3809–3818, PMLR. ISSN: 2640-3498.
- Show Your Work: Scratchpads for Intermediate Computation with Language Models. ArXiv:2112.00114 [cs].
- OpenAI. 2023. GPT-4 Technical Report. ArXiv:2303.08774 [cs].
- Multimodal Explanations: Justifying Decisions and Pointing to the Evidence. pages 8779–8788.
- Telling BERT’s Full Story: from Local Attention to Global Aggregation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 105–124, Association for Computational Linguistics, Online.
- Language Models as Knowledge Bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Association for Computational Linguistics, Hong Kong, China.
- An Empirical Comparison of Instance Attribution Methods for NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 967–975, Association for Computational Linguistics, Online.
- Interpretable Textual Neuron Representations for NLP. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 325–327, Association for Computational Linguistics, Brussels, Belgium.
- Evaluating neural network explanation methods using hybrid documents and morphosyntactic agreement. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 340–350, Association for Computational Linguistics, Melbourne, Australia.
- Hypothesis Only Baselines in Natural Language Inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191, Association for Computational Linguistics, New Orleans, Louisiana.
- Evaluating Explanations: How Much Do Explanations from the Teacher Aid Students? Transactions of the Association for Computational Linguistics, 10:359–375.
- Learning to Deceive with Attention-Based Explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4782–4793, Association for Computational Linguistics, Online.
- Limitations of Language Models in Arithmetic and Symbolic Induction. ArXiv:2208.05051 [cs].
- Analyzing Linguistic Knowledge in Sequential Model of Sentence. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 826–835, Association for Computational Linguistics, Austin, Texas.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683 [cs, stat]. ArXiv: 1910.10683.
- Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 556–568, Association for Computational Linguistics, Online.
- SELFEXPLAIN: A Self-Explaining Architecture for Neural Text Classifiers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 836–850, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
- Explain Yourself! Leveraging Language Models for Commonsense Reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Association for Computational Linguistics, Florence, Italy.
- Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7237–7256, Association for Computational Linguistics, Online.
- Linear Guardedness and its Implications. ArXiv:2210.10012 [cs].
- Counterfactual Interventions Reveal the Causal Effect of Relative Clause Representations on Agreement Prediction. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 194–209, Association for Computational Linguistics, Online.
- Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3363–3377, Association for Computational Linguistics, Online.
- "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1135–1144, Association for Computing Machinery, New York, NY, USA.
- Anchors: High-Precision Model-Agnostic Explanations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). Number: 1.
- Roese, Neal J. and James M. Olson. 1995. Counterfactual thinking: A critical overview. In What might have been: The social psychology of counterfactual thinking. Lawrence Erlbaum Associates, Inc, Hillsdale, NJ, US, pages 1–55.
- Neuron-level Interpretation of Deep NLP Models: A Survey. Transactions of the Association for Computational Linguistics, 10:1285–1303. Place: Cambridge, MA Publisher: MIT Press.
- WinoGrande: an adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Evaluating the visualization of what a Deep Neural Network has learned. ArXiv:1509.06321 [cs].
- Bridging CNNs, RNNs, and Weighted Finite-State Machines. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 295–305, Association for Computational Linguistics, Melbourne, Australia.
- Serrano, Sofia and Noah A. Smith. 2019. Is Attention Interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, Association for Computational Linguistics, Florence, Italy.
- Shapley, L. S. 1953. 17. A Value for n-Person Games. In Harold William Kuhn and Albert William Tucker, editors, Contributions to the Theory of Games (AM-28), Volume II. Princeton University Press, pages 307–318.
- Learning Important Features Through Propagating Activation Differences. In Proceedings of the 34th International Conference on Machine Learning, pages 3145–3153, PMLR. ISSN: 2640-3498.
- Not Just a Black Box: Learning Important Features Through Propagating Activation Differences. arXiv:1605.01713 [cs]. ArXiv: 1605.01713.
- Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI. ArXiv:2205.12469 [cs].
- Deep inside convolutional networks: Visualising image classification models and saliency maps. In In Workshop at International Conference on Learning Representations.
- Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. Association for Computing Machinery, New York, NY, USA, pages 180–186.
- SmoothGrad: removing noise by adding noise. arXiv:1706.03825 [cs, stat]. ArXiv: 1706.03825.
- Striving for Simplicity: The All Convolutional Net.
- Seq2seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models. IEEE Transactions on Visualization and Computer Graphics, 25(1):353–363. Conference Name: IEEE Transactions on Visualization and Computer Graphics.
- LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks. IEEE Transactions on Visualization and Computer Graphics, 24(1):667–676. Conference Name: IEEE Transactions on Visualization and Computer Graphics.
- Obtaining Faithful Interpretations from Compositional Neural Networks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5594–5608, Association for Computational Linguistics, Online.
- Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 3319–3328, JMLR.org, Sydney, NSW, Australia.
- Patient representation learning and interpretable evaluation using clinical notes. Journal of Biomedical Informatics, 84:103–113.
- ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621–3634, Association for Computational Linguistics, Online.
- The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 107–118, Association for Computational Linguistics, Online.
- How does This Interaction Affect Me? Interpretable Attribution for Feature Interactions. In Advances in Neural Information Processing Systems, volume 33, pages 6147–6159, Curran Associates, Inc.
- What if This Modified That? Syntactic Interventions with Counterfactual Embeddings. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 862–875, Association for Computational Linguistics, Online.
- Tutek, Martin and Jan Snajder. 2020. Staying True to Your Word: (How) Can Attention Become Explanation? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 131–142, Association for Computational Linguistics, Online.
- Attention Interpretability Across NLP Tasks. arXiv:1909.11218 [cs]. ArXiv: 1909.11218.
- Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc.
- Diagnostic classifiers: revealing how neural networks process hierarchical structure. page 9.
- Vig, Jesse. 2019. Visualizing Attention in Transformer-Based Language Representation Models. arXiv:1904.02679 [cs, stat]. ArXiv: 1904.02679.
- Investigating Gender Bias in Language Models Using Causal Mediation Analysis. In 34th Conference on Neural Information Processing Systems, page 14, Vancouver, Canada.
- Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, Association for Computational Linguistics, Florence, Italy.
- Voita, Elena and Ivan Titov. 2020. Information-Theoretic Probing with Minimum Description Length. arXiv:2003.12298 [cs]. ArXiv: 2003.12298.
- Interpreting Neural Networks with Nearest Neighbors. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 136–144, Association for Computational Linguistics, Brussels, Belgium.
- Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153–2162, Association for Computational Linguistics, Hong Kong, China.
- Interpreting Predictions of NLP Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 20–23, Association for Computational Linguistics, Online.
- AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models. arXiv:1909.09251 [cs]. ArXiv: 1909.09251.
- SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In Advances in Neural Information Processing Systems, volume 32, Curran Associates, Inc.
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Association for Computational Linguistics, Brussels, Belgium.
- Gradient-based Analysis of NLP Models is Manipulable. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 247–258, Association for Computational Linguistics, Online.
- A Fine-grained Interpretability Evaluation Benchmark for Neural NLP. Technical Report arXiv:2205.11097, arXiv. Issue: arXiv:2205.11097 arXiv:2205.11097 [cs] type: article.
- Self-Consistency Improves Chain of Thought Reasoning in Language Models. ArXiv:2203.11171 [cs].
- Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv:2201.11903 [cs].
- The What-If Tool: Interactive Probing of Machine Learning Models. IEEE Transactions on Visualization and Computer Graphics, 26(1):56–65. Conference Name: IEEE Transactions on Visualization and Computer Graphics.
- Reframing Human-AI Collaboration for Generating Free-Text Explanations. Technical Report arXiv:2112.08674, arXiv. Issue: arXiv:2112.08674 arXiv:2112.08674 [cs] type: article.
- Measuring Association Between Labels and Free-Text Rationales. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266–10284, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
- Wiegreffe, Sarah and Yuval Pinter. 2019. Attention is not not Explanation. arXiv:1908.04626 [cs]. ArXiv: 1908.04626.
- Winship, Christopher and Stephen L. Morgan. 1999. The Estimation of Causal Effects from Observational Data. Annual Review of Sociology, 25(1):659–706. _eprint: https://doi.org/10.1146/annurev.soc.25.1.659.
- Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6707–6723, Association for Computational Linguistics, Online.
- An Interpretable Knowledge Transfer Model for Knowledge Base Completion. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 950–962, Association for Computational Linguistics, Vancouver, Canada.
- Yang, Mengjiao and Been Kim. 2019. Benchmarking Attribution Methods with Relative Feature Importance. ArXiv:1907.09701 [cs, stat].
- HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Association for Computational Linguistics, Brussels, Belgium.
- Ye, Xi and Greg Durrett. 2022. The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning. ArXiv:2205.03401 [cs].
- Ye, Xi and Greg Durrett. 2023. Explanation Selection Using Unlabeled Data for In-Context Learning. ArXiv:2302.04813 [cs].
- Complementary Explanations for Effective In-Context Learning. ArXiv:2211.13892 [cs].
- Connecting Attributions and QA Model Behavior on Realistic Counterfactuals. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5496–5512, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
- On the (In)fidelity and Sensitivity of Explanations. In Advances in Neural Information Processing Systems, volume 32, Curran Associates, Inc.
- On Completeness-aware Concept-Based Explanations in Deep Neural Networks. In Advances in Neural Information Processing Systems, volume 33, pages 20554–20565, Curran Associates, Inc.
- Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In Advances in Neural Information Processing Systems, volume 31, Curran Associates, Inc.
- On the Sensitivity and Stability of Model Interpretations in NLP. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2631–2647, Association for Computational Linguistics, Dublin, Ireland.
- Yin, Kayo and Graham Neubig. 2022. Interpreting Language Models with Contrastive Explanations. Technical Report arXiv:2202.10419, arXiv. Issue: arXiv:2202.10419 arXiv:2202.10419 [cs] type: article.
- Using “Annotator Rationales” to Improve Machine Learning for Text Categorization. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 260–267, Association for Computational Linguistics, Rochester, New York.
- Zeiler, Matthew D. and Rob Fergus. 2014. Visualizing and Understanding Convolutional Networks. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, volume 8689. Springer International Publishing, Cham, pages 818–833. Series Title: Lecture Notes in Computer Science.
- HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Association for Computational Linguistics, Florence, Italy.
- The Irrationality of Neural Rationale Models. Technical Report arXiv:2110.07550, arXiv. Issue: arXiv:2110.07550 arXiv:2110.07550 [cs] type: article.
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. ArXiv:2205.10625 [cs].
- Do Feature Attribution Methods Correctly Attribute Features? ArXiv:2104.14403 [cs].
- ExSum: From Local Explanations to Model Understanding. ArXiv:2205.00130 [cs].
- Zhou, Yilun and Julie Shah. 2022. The Solvability of Interpretability Evaluation Metrics. ArXiv:2205.08696 [cs].
- Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1651–1661, Association for Computational Linguistics, Florence, Italy.