Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 42 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 217 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

LLM-Generated Black-box Explanations Can Be Adversarially Helpful (2405.06800v3)

Published 10 May 2024 in cs.CL

Abstract: LLMs are becoming vital tools that help us solve and understand complex problems by acting as digital assistants. LLMs can generate convincing explanations, even when only given the inputs and outputs of these problems, i.e., in a ``black-box'' approach. However, our research uncovers a hidden risk tied to this approach, which we call adversarial helpfulness. This happens when an LLM's explanations make a wrong answer look right, potentially leading people to trust incorrect solutions. In this paper, we show that this issue affects not just humans, but also LLM evaluators. Digging deeper, we identify and examine key persuasive strategies employed by LLMs. Our findings reveal that these models employ strategies such as reframing the questions, expressing an elevated level of confidence, and cherry-picking evidence to paint misleading answers in a credible light. To examine if LLMs are able to navigate complex-structured knowledge when generating adversarially helpful explanations, we create a special task based on navigating through graphs. Most LLMs are not able to find alternative paths along simple graphs, indicating that their misleading explanations aren't produced by only logical deductions using complex knowledge. These findings shed light on the limitations of the black-box explanation setting and allow us to provide advice on the safe usage of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Faithfulness vs. plausibility: On the (un)reliability of explanations from large language models, 2024.
  3. Explanations for CommonsenseQA: New Dataset and Models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  3050–3065, Online, August 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.238. URL https://aclanthology.org/2021.acl-long.238.
  4. Explanations for CommonsenseQA: New Dataset and Models. In ACL-IJCNLP, pp.  3050–3065, Online, August 2021b. Association for Computational Linguistics. URL https://aclanthology.org/2021.acl-long.238.
  5. Yi: Open foundation models by 01.ai, 2024.
  6. Anthropic. Introducing Claude, 2023. URL https://www.anthropic.com/news/introducing-claude.
  7. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  8. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  9. A large annotated corpus for learning natural language inference. In Lluís Màrquez, Chris Callison-Burch, and Jian Su (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  632–642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https://aclanthology.org/D15-1075.
  10. Machine explanations and human understanding. arXiv preprint arXiv:2202.04092, 2022.
  11. REV: Information-theoretic evaluation of free-text rationales. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2007–2030, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.112. URL https://aclanthology.org/2023.acl-long.112.
  12. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  13. Cohere. Cohere Command, 2023. URL https://cohere.com/models/command.
  14. Attack prompt generation for red teaming and defending large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 2176–2189, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.143. URL https://aclanthology.org/2023.findings-emnlp.143.
  15. SemEval-2021 task 6: Detection of persuasion techniques in texts and images. In Alexis Palmer, Nathan Schneider, Natalie Schluter, Guy Emerson, Aurelie Herbelot, and Xiaodan Zhu (eds.), Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pp.  70–98, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.semeval-1.7. URL https://aclanthology.org/2021.semeval-1.7.
  16. Everything of thoughts: Defying the law of penrose triangle for thought generation. arXiv:2311.04254, 2023. URL https://arxiv.org/abs/2311.04254.
  17. Faith and fate: Limits of transformers on compositionality. Advances in Neural Information Processing Systems, 36, 2024.
  18. Explainability pitfalls: Beyond dark patterns in explainable ai. arXiv preprint arXiv:2109.12480, 2021.
  19. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  20. Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5540–5552, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.491. URL https://aclanthology.org/2020.acl-main.491.
  21. Is explanation the cure? misinformation mitigation in the short term and long term. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 1313–1323, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.92. URL https://aclanthology.org/2023.findings-emnlp.92.
  22. Can LLMs effectively leverage graph structural information: when and why. arXiv preprint arXiv:2309.16595, 2023a. URL https://arxiv.org/abs/2309.16595.
  23. Rigorously assessing natural language explanations of neurons. arXiv preprint arXiv:2309.10312, 2023b.
  24. Evaluating the utility of model explanations for model development. arXiv preprint arXiv:2312.06032, 2023.
  25. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  26. Michal Kosinski. Evaluating large language models in theory of mind tasks, 2024.
  27. Properties and challenges of llm-generated explanations. arXiv:2402.10532, 2024. URL https://arxiv.org/abs/2402.10532.
  28. ConceptNet—a practical commonsense reasoning tool-kit. BT technology journal, 22(4):211–226, 2004.
  29. Faithful Chain-of-Thought Reasoning. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  305–329, Nusa Dua, Bali, November 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.ijcnlp-main.20.
  30. Clear: Generative counterfactual explanations on graphs. Advances in Neural Information Processing Systems, 35:25895–25907, 2022.
  31. Rhetorical structure theory: Toward a functional theory of text organization. Text-interdisciplinary Journal for the Study of Discourse, 8(3):243–281, 1988.
  32. Tim Miller. Explainable AI is Dead, Long Live Explainable AI! Hypothesis-driven decision support, March 2023. URL http://arxiv.org/abs/2302.12389.
  33. Presentations by the humans and for the humans: Harnessing LLMs for generating persona-aware slides from documents. In Yvette Graham and Matthew Purver (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2664–2684, St. Julian’s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-long.163.
  34. What does the Knowledge Neuron Thesis Have to do with Knowledge? In ICLR, October 2024. URL https://openreview.net/forum?id=2HJRwwbV3G.
  35. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  36. OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  37. News categorization, framing and persuasion techniques: Annotation guidelines. Technical report, Technical report, European Commission Joint Research Centre, Ispra (Italy), 2023. URL https://knowledge4policy.ec.europa.eu/sites/default/files/JRC132862_technical_report_annotation_guidelines_final_with_affiliations_1.pdf.
  38. Robust stochastic graph generator for counterfactual explanations. AAAI, 2024. URL https://arxiv.org/abs/2312.11747.
  39. Why think step by step? Reasoning emerges from the locality of experience. Advances in Neural Information Processing Systems, 36, 2024.
  40. Can language models teach? teacher explanations improve student performance via personalization. Advances in Neural Information Processing Systems, 36, 2024.
  41. MuSR: Testing the limits of chain-of-thought with multistep soft reasoning. arXiv:2310.16049, 2023. URL https://arxiv.org/abs/2310.16049.
  42. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), NAACL-HLT, pp.  4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https://aclanthology.org/N19-1421.
  43. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
  44. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 2024.
  45. Using natural language explanations to rescale human judgments. arXiv preprint arXiv:2305.14770, 2023.
  46. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  47. Reframing human-AI collaboration for generating free-text explanations. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  632–658, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.47. URL https://aclanthology.org/2022.naacl-main.47.
  48. WizardLM: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  49. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  50. Xi Ye and Greg Durrett. The unreliability of explanations in few-shot prompting for textual reasoning. Advances in neural information processing systems, 35:30378–30392, 2022.
  51. How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs, 2024. URL https://arxiv.org/abs/2401.06373.
  52. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, 2023.
  53. Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs, 2024. URL https://arxiv.org/abs/2403.05020.
  54. Situated natural language explanations. arXiv:2308.14115, 2023. URL https://arxiv.org/abs/2308.14115.
  55. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube