Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (2305.04388v2)
Abstract: LLMs can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. This level of transparency into LLMs' predictions would yield significant safety benefits. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs--e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)"--which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations rationalizing those answers. This causes accuracy to drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard, when testing with GPT-3.5 from OpenAI and Claude 1.0 from Anthropic. On a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. Our findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods.
- Jacob Andreas. Language Models as Agent Models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5769–5779, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.423.
- Anthropic. Meet Claude. https://www.anthropic.com/product, 2023. Accessed: 2023-04-03.
- Constitutional AI: Harmlessness from AI Feedback, 2022. URL https://arxiv.org/abs/2212.08073.
- Discovering Latent Knowledge in Language Models Without Supervision. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ETKGuby0hcs.
- Can Rationalization Improve Robustness? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3792–3805, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.278. URL https://aclanthology.org/2022.naacl-main.278.
- Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations, July 2023. URL http://arxiv.org/abs/2307.08678. arXiv:2307.08678 [cs].
- Faithful Reasoning Using Large Language Models, August 2022. URL http://arxiv.org/abs/2208.14271. arXiv:2208.14271 [cs].
- Language models show human-like content effects on reasoning, July 2022. URL http://arxiv.org/abs/2207.07051. arXiv:2207.07051 [cs].
- Towards A Rigorous Science of Interpretable Machine Learning, March 2017. URL http://arxiv.org/abs/1702.08608. arXiv:1702.08608 [cs, stat].
- Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model. In Workshop on Trustworthy and Socially Responsible Machine Learning, NeurIPS 2022, 2022. URL https://openreview.net/forum?id=c4ob9nFloFW.
- The Capacity for Moral Self-Correction in Large Language Models, February 2023. URL http://arxiv.org/abs/2302.07459. arXiv:2302.07459 [cs].
- Leo Gao. Shapley Value Attribution in Chain of Thought. 2023. URL https://www.alignmentforum.org/posts/FX5JmftqL2j6K8dn4/shapley-value-attribution-in-chain-of-thought.
- ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=xYlJRpzZtsY.
- Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4351–4367, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.390. URL https://aclanthology.org/2020.findings-emnlp.390.
- Denis Hilton. Social Attribution and Explanation. In Michael R. Waldmann, editor, The Oxford Handbook of Causal Reasoning, page 0. Oxford University Press, June 2017. ISBN 978-0-19-939955-0. doi: 10.1093/oxfordhb/9780199399550.013.33. URL https://doi.org/10.1093/oxfordhb/9780199399550.013.33.
- The Curious Case of Neural Text Degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH.
- Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.386. URL https://aclanthology.org/2020.acl-main.386.
- Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1266–1279, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.82.
- Large Language Models are Zero-Shot Reasoners. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf.
- Measuring Faithfulness in Chain-of-Thought Reasoning, July 2023. URL http://arxiv.org/abs/2307.13702. arXiv:2307.13702 [cs].
- Solving Quantitative Reasoning Problems with Language Models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=IFXTZERXdM7.
- Holistic Evaluation of Language Models, November 2022. URL http://arxiv.org/abs/2211.09110. arXiv:2211.09110 [cs].
- Tania Lombrozo. The structure and function of explanations. Trends in Cognitive Sciences, 10(10):464–470, October 2006. ISSN 1364-6613. doi: 10.1016/j.tics.2006.08.004.
- Towards Faithful Model Explanation in NLP: A Survey. 2022. doi: 10.48550/ARXIV.2209.11326. URL https://arxiv.org/abs/2209.11326. Publisher: arXiv Version Number: 2.
- Faithful Chain-of-Thought Reasoning, 2023. URL https://api.semanticscholar.org/CorpusID:256416127.
- Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango, October 2022. URL http://arxiv.org/abs/2209.07686. arXiv:2209.07686 [cs].
- Inverse scaling prize: Second round winners, 2023. URL https://irmckenzie.co.uk/round2.
- Why do humans reason? Arguments for an argumentative theory. Behavioral and Brain Sciences, 34(2):57–74, April 2011. ISSN 1469-1825, 0140-525X. doi: 10.1017/S0140525X10000968. URL https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/why-do-humans-reason-arguments-for-an-argumentative-theory/53E3F3180014E80E8BE9FB7A2DD44049. Publisher: Cambridge University Press.
- Multi-hop Reading Comprehension through Question Decomposition and Rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6097–6109, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1613. URL https://aclanthology.org/P19-1613.
- Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.759.
- Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84:231–259, 1977. ISSN 1939-1471. doi: 10.1037/0033-295X.84.3.231. Place: US Publisher: American Psychological Association.
- Show Your Work: Scratchpads for Intermediate Computation with Language Models, November 2021. URL http://arxiv.org/abs/2112.00114. arXiv:2112.00114 [cs].
- OpenAI. Model index for researchers. https://platform.openai.com/docs/model-index-for-researchers, 2023. Accessed: 2023-04-03.
- Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON.
- How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions, September 2023. URL http://arxiv.org/abs/2309.15840. arXiv:2309.15840 [cs].
- BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.165. URL https://aclanthology.org/2022.findings-acl.165.
- Unsupervised Question Decomposition for Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8864–8880, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.713. URL https://aclanthology.org/2020.emnlp-main.713.
- Discovering Language Model Behaviors with Model-Written Evaluations, December 2022. URL http://arxiv.org/abs/2212.09251. arXiv:2212.09251 [cs].
- Question Decomposition Improves the Faithfulness of Model-Generated Reasoning, July 2023. URL http://arxiv.org/abs/2307.11768. arXiv:2307.11768 [cs].
- Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes, January 2023. URL http://arxiv.org/abs/2301.01751. arXiv:2301.01751 [cs].
- Cynthia Rudin. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead, September 2019. URL http://arxiv.org/abs/1811.10154. arXiv:1811.10154 [cs, stat].
- Self-critiquing models for assisting human evaluators, June 2022. URL http://arxiv.org/abs/2206.05802. arXiv:2206.05802 [cs].
- On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning, December 2022. URL http://arxiv.org/abs/2212.08061. arXiv:2212.08061 [cs].
- Towards Understanding Sycophancy in Language Models, October 2023. URL http://arxiv.org/abs/2310.13548. arXiv:2310.13548 [cs, stat].
- Large Language Models Can Be Easily Distracted by Irrelevant Context, February 2023. URL http://arxiv.org/abs/2302.00093. arXiv:2302.00093 [cs].
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models, June 2022. URL http://arxiv.org/abs/2206.04615. arXiv:2206.04615 [cs, stat].
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, October 2022. URL http://arxiv.org/abs/2210.09261. arXiv:2210.09261 [cs].
- Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2078–2093, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.134.
- Solving math word problems with process- and outcome-based feedback, November 2022. URL http://arxiv.org/abs/2211.14275. arXiv:2211.14275 [cs].
- Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023. URL https://openreview.net/forum?id=L9UMeoeU2i.
- Do Prompt-Based Models Really Understand the Meaning of Their Prompts? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2300–2344, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.167. URL https://aclanthology.org/2022.naacl-main.167.
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
- Xi Ye and Greg Durrett. The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Bct2f8fRd8S.
- Complementary Explanations for Effective In-Context Learning, November 2022. URL http://arxiv.org/abs/2211.13892. arXiv:2211.13892 [cs].
- STaR: Bootstrapping Reasoning With Reasoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_3ELRdg2sgI.
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WZH7099tgfM.