Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning

Published 14 Nov 2023 in cs.AI and cs.CL | (2311.07954v2)

Abstract: Logical reasoning has been an ongoing pursuit in the field of AI. Despite significant advancements made by LLMs, they still struggle with complex logical reasoning problems. To enhance reasoning performance, one promising direction is scalable oversight, which requires LLMs to identify their own errors and then improve by themselves. Various self-verification methods have been proposed in pursuit of this goal. Nevertheless, whether existing models understand their own errors well is still under investigation. In this paper, we take a closer look at the self-verification abilities of LLMs in the context of logical reasoning, focusing on their ability to identify logical fallacies accurately. We introduce a dataset, FALLACIES, containing 232 types of reasoning fallacies categorized in a hierarchical taxonomy. By conducting exhaustive experiments on FALLACIES, we obtain comprehensive and detailed analyses of a series of models on their verification abilities. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods. Drawing from these observations, we offer suggestions for future research and practical applications of self-verification methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Falcon-40B: an open large language model with state-of-the-art performance.
  2. Konstantine Arkoudas. 2023. GPT-4 can’t reason. CoRR, abs/2308.03762.
  3. Qwen technical report. CoRR, abs/2309.16609.
  4. Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073.
  5. Bo Bennett. 2012. Logically fallacious: the ultimate collection of over 300 logical fallacies (Academic Edition). eBookIt. com.
  6. Measuring progress on scalable oversight for large language models. CoRR, abs/2211.03540.
  7. Improving code generation by training with natural language feedback. CoRR, abs/2303.16749.
  8. Iterative translation refinement with large language models. CoRR, abs/2306.03856.
  9. Factool: Factuality detection in generative AI - A tool augmented framework for multi-task and multi-domain scenarios. CoRR, abs/2307.13528.
  10. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402.
  11. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
  12. Transformers as soft reasoners over language. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pages 3882–3890. ijcai.org.
  13. Selection-inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  14. Recognizing Textual Entailment: Models and Applications. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.
  15. Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7358–7370. Association for Computational Linguistics.
  16. GLM: general language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 320–335. Association for Computational Linguistics.
  17. Bridging the gap: A survey on integrating (human) feedback for natural language generation. CoRR, abs/2305.00955.
  18. James Fieser and Bradley Dowden. 2011. Internet encyclopedia of philosophy: Fallacies.
  19. CRITIC: large language models can self-correct with tool-interactive critiquing. CoRR, abs/2305.11738.
  20. Before name-calling: Dynamics and triggers of ad hominem fallacies in web argumentation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 386–396, New Orleans, Louisiana. Association for Computational Linguistics.
  21. Rethinking with retrieval: Faithful large language model inference. CoRR, abs/2301.00303.
  22. METGEN: A module-based entailment tree generation framework for answer explanation. In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 1887–1905. Association for Computational Linguistics.
  23. Faithful question answering with monte-carlo planning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 3944–3965. Association for Computational Linguistics.
  24. Large language models can self-improve. CoRR, abs/2210.11610.
  25. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
  26. William Stanley Jevons. 1872. Elementary lessons in logic: Deductive and inductive: With copious questions and examples, and a vocabulary of logical terms. Macmillan.
  27. Merit: Meta-path guided contrastive learning for logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3496–3509. Association for Computational Linguistics.
  28. Logical fallacy detection. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7180–7198, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  29. Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 1266–1279. Association for Computational Linguistics.
  30. Large language models are zero-shot reasoners. In NeurIPS.
  31. Measuring faithfulness in chain-of-thought reasoning. CoRR, abs/2307.13702.
  32. Joe YF Lau. 2011. An introduction to critical thinking and creativity: Think more, think better. John Wiley & Sons.
  33. Jan Leike and Ilya Sutskever. 2023. Introducing superalignment. OpenAI.
  34. Let’s verify step by step. CoRR, abs/2305.20050.
  35. Deductive verification of chain-of-thought reasoning. CoRR, abs/2306.03872.
  36. Evaluating the logical reasoning ability of chatgpt and GPT-4. CoRR, abs/2304.03439.
  37. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9):195:1–195:35.
  38. Self-refine: Iterative refinement with self-feedback. CoRR, abs/2303.17651.
  39. PD Magnus. 2005. forall x: An introduction to formal logic.
  40. John McCarthy. 1989. Artificial intelligence, logic and formalizing common sense. Philosophical Logic and Artificial Intelligence, pages 161–190.
  41. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. CoRR, abs/2308.00436.
  42. LEVER: learning to verify language-to-code generation with execution. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 26106–26128. PMLR.
  43. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  44. Training language models to follow instructions with human feedback. In NeurIPS.
  45. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. CoRR, abs/2305.12295.
  46. How susceptible are llms to logical fallacies? CoRR, abs/2308.09853.
  47. Gpt-3.5 turbo fine-tuning and api updates. https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates.
  48. Logigan: Learning logical reasoning via adversarial pre-training. In NeurIPS.
  49. Nicholas Rescher and Morton L. Schagrin. 2023. Fallacy. Encyclopedia Britannica.
  50. Training language models with language feedback at scale. CoRR, abs/2303.16755.
  51. Reflexion: Language agents with verbal reinforcement learning.
  52. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pages 4444–4451. AAAI Press.
  53. Christian Stab and Iryna Gurevych. 2017. Recognizing insufficiently supported arguments in argumentative essays. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 980–990, Valencia, Spain. Association for Computational Linguistics.
  54. Proofwriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 3621–3634. Association for Computational Linguistics.
  55. InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
  56. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  57. Can large language models really improve by self-critiquing their own plans? CoRR, abs/2310.08118.
  58. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508. Association for Computational Linguistics.
  59. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022.
  60. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
  61. Large language models are reasoners with self-verification. CoRR, abs/2212.09561.
  62. Decomposition enhances reasoning via self-evaluation guided decoding. CoRR, abs/2305.00633.
  63. Wizardlm: Empowering large language models to follow complex instructions. CoRR, abs/2304.12244.
  64. Baichuan 2: Open large-scale language models. CoRR, abs/2309.10305.
  65. Generating natural language proofs with verifier-guided search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 89–105. Association for Computational Linguistics.
  66. Logical reasoning over natural language as knowledge representation: A survey. CoRR, abs/2303.12023.
  67. Tree of thoughts: Deliberate problem solving with large language models. CoRR, abs/2305.10601.
  68. Nature language reasoning, A survey. CoRR, abs/2303.14725.
  69. Star: Bootstrapping reasoning with reasoning. In NeurIPS.
  70. Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685.
Citations (14)

Summary

  • The paper demonstrates that LLMs exhibit self-verification capabilities but often miss logical fallacies.
  • It introduces a comprehensive fallacies dataset with 4,640 reasoning steps across 232 types of fallacies.
  • Experimental results reveal that even state-of-the-art models achieve less than 80% accuracy in detecting fallacies.

A Closer Look at the Self-Verification Abilities of LLMs in Logical Reasoning

Introduction

The paper "A Closer Look at the Self-Verification Abilities of LLMs in Logical Reasoning" addresses the critical issue of logical reasoning within the framework of LLMs. The paper identifies a prevalent problem where LLMs, despite their emergent reasoning abilities, frequently generate invalid reasoning steps that fall into logical fallacies. To tackle this, a scalable oversight approach known as self-verification is explored, positing that LLMs can self-evaluate and correct erroneous logic. Figure 1

Figure 1: The self-verification approach requires LLMs to identify the fallacious steps in their own reasoning process. However, LLMs might be susceptible to certain types of fallacies and fail to identify them, leading to the potential failure of self-verification.

Fallacies Dataset Introduction

The authors introduce a comprehensive dataset named "Fallacies," composed of 4,640 logical reasoning steps encompassing 232 types of fallacies categorized within a hierarchical taxonomy. The aim is to evaluate the self-verification abilities of LLMs by focusing on their performance in identifying these fallacies. Figure 2

Figure 2: The hierarchical taxonomy of fallacies. For each sub-category, we present its definition and an example of a fallacy within the sub-category. We use square brackets to indicate the premises and conclusions.

Methodology and Experiments

The dataset allows for an in-depth analysis of various LLMs such as GPT-4, GPT-3.5, and others. Models were evaluated based on their ability to identify both the presence of logical fallacies and the specific types when present. Results showed that even state-of-the-art LLMs struggled, with most achieving an accuracy of less than 80% in identifying fallacious steps, indicating significant room for improvement.

A notable finding is the performance disparity among LLMs in identifying different categories of fallacies. Formal fallacies, which focus more on the logical structure, posed more of a challenge than informal fallacies that dealt with content. This indicates that logical structure understanding remains a bottleneck in logical reasoning capabilities of LLMs. Figure 3

Figure 3: The definition of the fallacy of "Affirming the Consequent", one of 232 types of fallacies in our dataset.

Implications for LLM Development

The study underscores the limitations of relying solely on self-verification methods for logical reasoning within LLMs. It suggests the need for more tailored approaches that enhance a model's understanding of logical structures and fallacies. Additionally, it highlights the importance of comprehensive datasets that include a wide range of reasoning errors to facilitate robust model training and evaluation.

Given the current limitations observed, future work may explore methodologies like multi-agent debate systems or human-in-the-loop interactions to augment self-verification strategies, possibly bridging the gap between human and machine logical reasoning capabilities.

Conclusion

In conclusion, while self-verification presents a promising avenue for improving the reasoning capabilities of LLMs, the current limitations necessitate a cautious approach to its application. The research provides valuable insights and a robust dataset for further work in enhancing logical reasoning in AI, emphasizing the need for ongoing refinement in both theoretical and practical domains. This paper serves as a reminder of the complexities inherent in mimicking human reasoning and the continuous effort required to achieve this goal in artificial systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.