Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models (2402.04614v3)
Abstract: LLMs are deployed as powerful tools for several NLP applications. Recent works show that modern LLMs can generate self-explanations (SEs), which elicit their intermediate reasoning steps for explaining their behavior. Self-explanations have seen widespread adoption owing to their conversational and plausible nature. However, there is little to no understanding of their faithfulness. In this work, we discuss the dichotomy between faithfulness and plausibility in SEs generated by LLMs. We argue that while LLMs are adept at generating plausible explanations -- seemingly logical and coherent to human users -- these explanations do not necessarily align with the reasoning processes of the LLMs, raising concerns about their faithfulness. We highlight that the current trend towards increasing the plausibility of explanations, primarily driven by the demand for user-friendly interfaces, may come at the cost of diminishing their faithfulness. We assert that the faithfulness of explanations is critical in LLMs employed for high-stakes decision-making. Moreover, we emphasize the need for a systematic characterization of faithfulness-plausibility requirements of different real-world applications and ensure explanations meet those needs. While there are several approaches to improving plausibility, improving faithfulness is an open challenge. We call upon the community to develop novel methods to enhance the faithfulness of self explanations thereby enabling transparent deployment of LLMs in diverse high-stakes settings.
- Openxai: Towards a transparent evaluation of model explanations, 2023.
- Opt-r: Exploring the role of explanations in finetuning and prompting for reasoning skills of large language models. arXiv, 2023.
- Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Medical Informatics and Decision Making, 20, 11 2020. doi: 10.1186/s12911-020-01332-6.
- Language models are few-shot learners. NeurIPS, 2020.
- Do models explain themselves? counterfactual simulatability of natural language explanations, 2023a.
- Xplainllm: A qa explanation dataset for understanding llm decision-making. arXiv, 2023b.
- Large language models in education: Vision and opportunities, 2023.
- Agcvt-prompt for sentiment classification: Automatically generating chain of thought and verbalizer in prompt learning. Engineering Applications of Artificial Intelligence, 2024.
- Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In ACL, 2020a.
- Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4198–4205, Online, July 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.386. URL https://aclanthology.org/2020.acl-main.386.
- Challenges and applications of large language models. arXiv, 2023.
- Llms as factual reasoners: Insights from existing benchmarks and beyond. arXiv, 2023.
- Measuring faithfulness in chain-of-thought reasoning, 2023.
- Visualizing and understanding neural models in nlp. In NAACL, 2015.
- Ai transparency in the age of llms: A human-centered research roadmap, 2023.
- Faithful chain-of-thought reasoning, 2023.
- Towards faithful model explanation in nlp: A survey, 2024.
- Divide and conquer for large language models reasoning. arXiv, 2024.
- Chain of thought utilization in large language models and application in nephrology. Medicina, 2024.
- Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
- The art of SOCRATIC QUESTIONING: Recursive thinking with large language models. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.255. URL https://aclanthology.org/2023.emnlp-main.255.
- Rasal, S. Llm harmony: Multi-agent communication for problem solving. arXiv, 2024.
- The benefits of a concise chain of thought on problem-solving in large language models. arXiv, 2024.
- Large language models help humans verify truthfulness–except when they are convincingly wrong. arXiv, 2023.
- Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023.
- Why can large language models generate correct chain-of-thoughts? arXiv, 2023.
- Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv, 2022.
- Emergent abilities of large language models. arXiv, 2022.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Teach me to explain: A review of datasets for explainable natural language processing. In NeurIPS Datasets and Benchmarks, 2021.
- Perturbed masking: Parameter-free probing for analyzing and interpreting bert. In ACL, 2020.
- The unreliability of explanations in few-shot prompting for textual reasoning. NeurIPS, 2022.