Faithfulness Tests for Natural Language Explanations (2305.18029v2)
Abstract: Explanations of neural models aim to reveal a model's decision-making process for its predictions. However, recent work shows that current methods giving explanations such as saliency maps or counterfactuals can be misleading, as they are prone to present reasons that are unfaithful to the model's inner workings. This work explores the challenging question of evaluating the faithfulness of natural language explanations (NLEs). To this end, we present two tests. First, we propose a counterfactual input editor for inserting reasons that lead to counterfactual predictions but are not reflected by the NLEs. Second, we reconstruct inputs from the reasons stated in the generated NLEs and check how often they lead to the same predictions. Our tests can evaluate emerging NLE models, proving a fundamental tool in the development of faithful NLEs.
- Sanity checks for saliency maps. Advances in Neural Information Processing Systems, 31.
- Post hoc explanations may be ineffective for detecting unknown spurious correlation. In International Conference on Learning Representations.
- Fairwashing explanations with off-manifold detergent. In International Conference on Machine Learning, pages 314β323. PMLR.
- A diagnostic study of explainability techniques for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3256β3274, Online. Association for Computational Linguistics.
- Generating fact checking explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7352β7364, Online. Association for Computational Linguistics.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632β642, Lisbon, Portugal. Association for Computational Linguistics.
- Can I Trust the Explainer? Verifying Post-hoc Explanatory Methods. In NeurIPS 2019 Workshop Safety and Robustness in Decision Making.
- The struggles of feature-based explanations: Shapley values vs. minimal sufficient subsets. In AAAI 2021 Workshop on Explainable Agency in Artificial Intelligence.
- e-SNLI: Natural Language Inference with Natural Language Explanations. In S.Β Bengio, H.Β Wallach, H.Β Larochelle, K.Β Grauman, N.Β Cesa-Bianchi, and R.Β Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9539β9549. Curran Associates, Inc.
- Make up your mind! adversarial generation of inconsistent natural language explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4157β4165, Online. Association for Computational Linguistics.
- Frame: Evaluating simulatability metrics for free-text rationales. arXiv preprint arXiv:2207.00779.
- A comparative study of faithfulness metrics for model interpretability methods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5029β5038, Dublin, Ireland. Association for Computational Linguistics.
- ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443β4458, Online. Association for Computational Linguistics.
- Explanations can be manipulated and geometry is to blame. In Advances in Neural Information Processing Systems, volumeΒ 32. Curran Associates, Inc.
- Christiane Fellbaum. 2010. Wordnet. In Theory and Applications of Ontology: Computer Applications, pages 231β243. Springer.
- Riccardo Guidotti. 2022. Counterfactual explanations and how to find them: literature review and benchmarking. Data Mining and Knowledge Discovery, pages 1β55.
- Leif Hancox-Li. 2020. Robustness in machine learning explanations: does it matter? In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 640β647.
- Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4351β4367, Online. Association for Computational Linguistics.
- spaCy: Industrial-strength Natural Language Processing in Python.
- Evaluations and methods for explanation through robustness analysis. In International Conference on Learning Representations.
- Contrastive explanations for model interpretability. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1597β1611, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Generating Fluent Fact Checking Explanations with Unsupervised Post-Editing. Information, 13(10).
- e-ViL: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1244β1254.
- Explaining chest x-ray pathologies in natural language. In Medical Image Computing and Computer Assisted Intervention β MICCAI 2022, pages 701β713, Cham. Springer Nature Switzerland.
- DiederikΒ P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1731β1751, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Knowledge-grounded self-rationalization via extractive and natural language explanations. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 14786β14801. PMLR.
- Few-shot self-rationalization with natural language prompts. Findings of NAACL.
- Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial intelligence, 267:1β38.
- WT5?! training text-to-text models to explain their predictions.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1β67.
- Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932β4942, Florence, Italy. Association for Computational Linguistics.
- Explaining NLP models via minimal contrastive editing (MiCE). In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3840β3852, Online. Association for Computational Linguistics.
- Counterfactual explanations can be manipulated. Advances in Neural Information Processing Systems, 34:62β75.
- Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES β20, page 180β186, New York, NY, USA. Association for Computing Machinery.
- Investigating the benefits of free-form rationales. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5867β5882, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- SemEval-2020 task 4: Commonsense validation and explanation. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 307β321, Barcelona (online). International Committee for Computational Linguistics.
- Sarah Wiegreffe and Ana Marasovic. 2021. Teach Me to Explain: A Review of Datasets for Explainable Natural Language Processing. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
- Measuring association between labels and free-text rationales. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266β10284, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6707β6723, Online. Association for Computational Linguistics.
- On the Sensitivity and Stability of Model Interpretations in NLP. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2631β2647, Dublin, Ireland. Association for Computational Linguistics.
- Few-Shot Out-of-Domain Transfer of Natural Language Explanations. In Proceedings of the Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Rethinking cooperative rationalization: Introspective extraction and complement control. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4094β4103, Hong Kong, China. Association for Computational Linguistics.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.