The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models (2404.03189v2)
Abstract: In order to oversee advanced AI systems, it is important to understand their underlying decision-making process. When prompted, LLMs can provide natural language explanations or reasoning traces that sound plausible and receive high ratings from human annotators. However, it is unclear to what extent these explanations are faithful, i.e., truly capture the factors responsible for the model's predictions. In this work, we introduce Correlational Explanatory Faithfulness (CEF), a metric that can be used in faithfulness tests based on input interventions. Previous metrics used in such tests take into account only binary changes in the predictions. Our metric accounts for the total shift in the model's predicted label distribution, more accurately reflecting the explanations' faithfulness. We then introduce the Correlational Counterfactual Test (CCT) by instantiating CEF on the Counterfactual Test (CT) from Atanasova et al. (2023). We evaluate the faithfulness of free-text explanations generated by few-shot-prompted LLMs from the Llama2 family on three NLP tasks. We find that our metric measures aspects of faithfulness which the CT misses.
- Sanity checks for saliency maps. In Neural Information Processing Systems.
- Explanations for CommonsenseQA: New Dataset and Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3050–3065, Online. Association for Computational Linguistics.
- Faithfulness tests for natural language explanations. ACL.
- The struggles of feature-based explanations: Shapley values vs. minimal sufficient subsets. In AAAI 2021 Workshop on Explainable Agency in Artificial Intelligence.
- e-SNLI: Natural language inference with natural language explanations. NeurIPS.
- Interpretable by design: Learning predictors by composing interpretable queries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7430–7443.
- Antonia Creswell and Murray Shanahan. 2022. Faithful reasoning using large language models.
- Selection-inference: Exploiting large language models for interpretable logical reasoning. ICLR.
- Eraser: A benchmark to evaluate rationalized nlp models. In Annual Meeting of the Association for Computational Linguistics.
- On interpretability of artificial neural networks: A survey. IEEE Transactions on Radiation and Plasma Medical Sciences, 5:741–760.
- Christiane Fellbaum. 2010. Wordnet. In Theory and applications of ontology: computer applications, pages 231–243. Springer.
- Alon Jacovi and Yoav Goldberg. 2020. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In Annual Meeting of the Association for Computational Linguistics.
- Explaining chest x-ray pathologies in natural language. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2022, pages 701–713, Cham. Springer Nature Switzerland.
- Tamera Lanham. 2022. Externalized reasoning oversight: a research direction for language model alignment.
- Measuring faithfulness in chain-of-thought reasoning.
- The alignment problem from a deep learning perspective.
- Huspacy: an industrial-strength hungarian natural language processing toolkit. arXiv preprint arXiv:2201.01956.
- Martin F Porter. 2001. Snowball: A language for stemming algorithms.
- Question decomposition improves the faithfulness of model-generated reasoning.
- Explain yourself! leveraging language models for commonsense reasoning.
- Fabien Roger and Ryan Greenblatt. 2023. Preventing language models from hiding their reasoning.
- Cynthia Rudin. 2018. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1:206 – 215.
- Goal misgeneralization: Why correct specifications aren’t enough for correct goals.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Llama 2: Open foundation and fine-tuned chat models.
- Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. ArXiv, abs/2305.04388.
- SemEval-2020 task 4: Commonsense validation and explanation. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 307–321, Barcelona (online). International Committee for Computational Linguistics.
- Honesty is the best policy: Defining and mitigating ai deception.
- Chain-of-thought prompting elicits reasoning in large language models.
- Sarah Wiegreffe and Ana Marasović. 2021. Teach me to explain: A review of datasets for explainable natural language processing. In NeurIPS Datasets and Benchmarks.
- Measuring association between labels and free-text rationales. In Conference on Empirical Methods in Natural Language Processing.