Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 44 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 13 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey (2404.01869v2)

Published 2 Apr 2024 in cs.CL and cs.AI

Abstract: LLMs have recently shown impressive performance on tasks involving reasoning, leading to a lively debate on whether these models possess reasoning capabilities similar to humans. However, despite these successes, the depth of LLMs' reasoning abilities remains uncertain. This uncertainty partly stems from the predominant focus on task performance, measured through shallow accuracy metrics, rather than a thorough investigation of the models' reasoning behavior. This paper seeks to address this gap by providing a comprehensive review of studies that go beyond task accuracy, offering deeper insights into the models' reasoning processes. Furthermore, we survey prevalent methodologies to evaluate the reasoning behavior of LLMs, emphasizing current trends and efforts towards more nuanced reasoning analyses. Our review suggests that LLMs tend to rely on surface-level patterns and correlations in their training data, rather than on sophisticated reasoning abilities. Additionally, we identify the need for further research that delineates the key differences between human and LLM-based reasoning. Through this survey, we aim to shed light on the complex reasoning processes within LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (119)
  1. Large language models for mathematical reasoning: Progresses and challenges. In Neele Falk, Sara Papi, and Mike Zhang (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.  225–237, St. Julian’s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-srw.17.
  2. AlephAlpha. Luminous language model, 2022.
  3. American Psychological Association. Behavior. In APA Dictionary of Psychology, n.d. Retrieved March 15, 2024, from https://dictionary.apa.org/behavior.
  4. Evaluating large language models with neubaroco: Syllogistic reasoning ability and human-like biases. arXiv preprint arXiv:2306.12567, 2023.
  5. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  6. Anthropic. Model card and evaluations for claude models, 2023. Technical Report. Available at: https://www-cdn.anthropic.com/files/4zrzovbb/website/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226.pdf.
  7. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion, 58:82–115, 2020. ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2019.12.012. URL https://www.sciencedirect.com/science/article/pii/S1566253519308103.
  8. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pp.  610–623, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922.
  9. The reversal curse: LLMs trained on “a is b” fail to learn “b is a”. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=GPKTIktA0k.
  10. ChatGPT’s information seeking strategy: Insights from the 20-questions game. In C. Maria Keet, Hung-Yi Lee, and Sina Zarrieß (eds.), Proceedings of the 16th International Natural Language Generation Conference, pp.  153–162, Prague, Czechia, September 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.inlg-main.11. URL https://aclanthology.org/2023.inlg-main.11.
  11. On the opportunities and risks of foundation models. ArXiv, 2021. URL https://crfm.stanford.edu/assets/report.pdf.
  12. Ali Borji. Stochastic parrots or intelligent systems? a perspective on true depth of understanding in llms, July 2023. Available at SSRN: https://ssrn.com/abstract=4507038 or http://dx.doi.org/10.2139/ssrn.4507038.
  13. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  14. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  15. Human Reasoning: The Psychology Of Deduction. Psychology Press, 1st edition, 1993. doi: 10.4324/9781315785028. URL https://doi.org/10.4324/9781315785028.
  16. Is bigger and deeper always better? probing llama across scales and layers, 2024a.
  17. Premise order matters in reasoning with large language models. arXiv preprint arXiv:2402.08939, 2024b.
  18. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402, 2023.
  19. Structured, flexible, and robust: Benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 44. Cognitive Science Society, 2022. URL https://escholarship.org/uc/item/3qq6w5kx.
  20. Language models show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051, 2022.
  21. Qlora: Efficient finetuning of quantized llms. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  10088–10115. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/1feb87871436031bdc0f2beaa62a049b-Paper-Conference.pdf.
  22. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  23. Aliya R. Dewey. Arbitrating norms for reasoning tasks. Synthese, 200(6):502, Nov 2022. ISSN 1573-0964. doi: 10.1007/s11229-022-03981-8. URL https://doi.org/10.1007/s11229-022-03981-8.
  24. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv preprint arXiv:2402.18312, 2024.
  25. Faith and fate: Limits of transformers on compositionality. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  70293–70332. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/deb3c28192f979302c157cb653c15e90-Paper-Conference.pdf.
  26. A systematic comparison of syllogistic reasoning in humans and language models. arXiv preprint arXiv:2311.00445, 2023.
  27. Abductive and Inductive Reasoning: Background and Issues, pp.  1–27. Springer Netherlands, Dordrecht, 2000. ISBN 978-94-017-0606-3. doi: 10.1007/978-94-017-0606-3˙1. URL https://doi.org/10.1007/978-94-017-0606-3_1.
  28. Shane Frederick. Cognitive reflection and decision making. Journal of Economic Perspectives, 19(4):25–42, December 2005. doi: 10.1257/089533005775196732. URL https://www.aeaweb.org/articles?id=10.1257/089533005775196732.
  29. CRASS: A novel data set and benchmark to test counterfactual reasoning of large language models. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.  2126–2140, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.229.
  30. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306, 2023.
  31. ROSCOE: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=xYlJRpzZtsY.
  32. Detecting blickets: How young children use information about novel causal powers in categorization and induction. Child Development, 71(5):1205–1222, 2000. ISSN 00093920, 14678624. URL http://www.jstor.org/stable/1131970.
  33. Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in chatgpt. Nature Computational Science, 3(10):833–838, Oct 2023. ISSN 2662-8457. doi: 10.1038/s43588-023-00527-x. URL https://doi.org/10.1038/s43588-023-00527-x.
  34. Inductive reasoning in humans and large language models. Cognitive Systems Research, 83:101155, 2024. ISSN 1389-0417. doi: https://doi.org/10.1016/j.cogsys.2023.101155. URL https://www.sciencedirect.com/science/article/pii/S1389041723000839.
  35. Applications and challenges in dynamic assessment. Peabody Journal of Education, 77(2):40–63, 2002. doi: 10.1207/S15327930PJE7702_5. URL https://doi.org/10.1207/S15327930PJE7702_5.
  36. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe.
  37. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  38. The Cambridge Handbook of Thinking and Reasoning. Cambridge Handbooks in Psychology. Cambridge University Press, 2005. ISBN 9780521824170. URL https://books.google.de/books?id=znbkHaC8QeMC.
  39. Leon Horsten. Philosophy of Mathematics. In Edward N. Zalta and Uri Nodelman (eds.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Winter 2023 edition, 2023.
  40. Towards a mechanistic interpretation of multi-step reasoning capabilities of language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  4902–4919, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.299. URL https://aclanthology.org/2023.emnlp-main.299.
  41. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  1049–1065, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.67. URL https://aclanthology.org/2023.findings-acl.67.
  42. Clomo: Counterfactual logical modification with large language models. arXiv preprint arXiv:2311.17438, 2024.
  43. Instructed to bias: Instruction-tuned language models exhibit emergent cognitive bias. arXiv preprint arXiv:2308.00225, 2023.
  44. CLadder: A benchmark to assess causal reasoning capabilities of language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=e2wtjx0Yqu.
  45. Can large language models infer causation from correlation? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vqIH0ObdqL.
  46. P. N. Johnson-Laird. Mental models and human reasoning. Proc Natl Acad Sci U S A, 107(43):18243–18250, 2010. doi: 10.1073/pnas.1012933107.
  47. Philip Johnson-Laird. How We Reason. Oxford University Press, 10 2008. ISBN 9780199551330. doi: 10.1093/acprof:oso/9780199551330.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199551330.001.0001.
  48. P.N. Johnson-Laird. How We Reason. Oxford University Press, 2006. ISBN 9780198569763. URL https://books.google.de/books?id=UjYsJN0krNYC.
  49. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  50. On belief bias in syllogistic reasoning. Psychological Review, 107(4):852–884, 2000. doi: 10.1037/0033-295X.107.4.852. URL https://doi.org/10.1037/0033-295X.107.4.852.
  51. Large language models are zero-shot reasoners. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  22199–22213. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/8bb0d291acd4acf06ef112099c16f326-Paper-Conference.pdf.
  52. Robert Koons. Defeasible Reasoning. In Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Summer 2022 edition, 2022.
  53. Towards understanding how machines can learn causal overhypotheses. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 45. Cognitive Science Society, 2023. URL https://escholarship.org/uc/item/9q29w1xh.
  54. Jacqueline P. Leighton. Defining and Describing Reason, pp.  3–11. Cambridge University Press, 2003.
  55. Alessandro Lenci. Understanding natural language understanding systems. a critical analysis. arXiv preprint arXiv:2303.04229, 2023.
  56. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7871–7880, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. URL https://aclanthology.org/2020.acl-main.703.
  57. Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  804–815, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.70. URL https://aclanthology.org/2023.acl-short.70.
  58. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439, 2023.
  59. Mathematical language models: A survey. arXiv preprint arXiv:2312.07622, 2024.
  60. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  61. Intelligence and Reasoning, pp.  419–441. Cambridge Handbooks in Psychology. Cambridge University Press, 2011.
  62. Dissociating language and thought in large language models. Trends in Cognitive Sciences, 2024.
  63. Jitendra Malik. Workshop on foundation models. https://www.youtube.com/watch?v=dG628PEN1fY&t=15830s, August 2021. YouTube video.
  64. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023.
  65. Inverse scaling: When bigger isn’t better. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=DwgRm72GQF. Featured Certification.
  66. Augmented language models: a survey. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=jh7wH2AzKK. Survey Certification.
  67. Melanie Mitchell. Can large language models reason? AI: A Guide for Thinking Humans, September 2023. URL https://aiguide.substack.com/p/can-large-language-models-reason. Accessed on: 13th of March 2024.
  68. The debate over understanding in ai’s large language models. Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023. doi: 10.1073/pnas.2215907120. URL https://www.pnas.org/doi/abs/10.1073/pnas.2215907120.
  69. Comparing inferential strategies of humans and large language models in deductive reasoning. arXiv preprint arXiv:2402.14856, 2024.
  70. Associative processes in intuitive judgment. Trends in Cognitive Sciences, 14(10):435–440, 2010. doi: 10.1016/j.tics.2010.07.004.
  71. OpenAI. Chatgpt: Optimizing language models for dialogue, 2022. Available at: https://openai.com/blog/chatgpt.
  72. Gpt-4 technical report, 2024.
  73. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  27730–27744. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
  74. David Papo. How can we study reasoning in the brain? Frontiers in Human Neuroscience, 9:222, 2015. doi: 10.3389/fnhum.2015.00222.
  75. Judea Pearl. Causality. Cambridge University Press, 2 edition, 2009.
  76. The book of why: the new science of cause and effect. Basic books, 2018.
  77. Assessing logical reasoning capabilities of encoder-only transformer models. arXiv preprint arXiv:2312.11720, 2023.
  78. ReCEval: Evaluating reasoning chains via correctness and informativeness. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  10066–10086, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.622. URL https://aclanthology.org/2023.emnlp-main.622.
  79. Reasoning with language model prompting: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5368–5393, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.294. URL https://aclanthology.org/2023.acl-long.294.
  80. Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=bNt7oajl2a.
  81. Language models are unsupervised multitask learners, 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
  82. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  83. Impact of pretraining term frequencies on few-shot numerical reasoning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  840–854, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.59. URL https://aclanthology.org/2022.findings-emnlp.59.
  84. Lance J. Rips. The psychology of knights and knaves. Cognition, 31(2):85–116, 1989. ISSN 0010-0277. doi: https://doi.org/10.1016/0010-0277(89)90019-X. URL https://www.sciencedirect.com/science/article/pii/001002778990019X.
  85. John Alan Robinson and Andrei Voronkov (eds.). Handbook of Automated Reasoning (in 2 volumes). Elsevier and MIT Press, 2001. ISBN 0-444-50813-9. URL https://www.sciencedirect.com/book/9780444508133/handbook-of-automated-reasoning.
  86. Thinking like a skeptic: Defeasible inference in natural language. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  4661–4675, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.418. URL https://aclanthology.org/2020.findings-emnlp.418.
  87. RobustLR: A diagnostic benchmark for evaluating logical robustness of deductive reasoners. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  9614–9631, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.653. URL https://aclanthology.org/2022.emnlp-main.653.
  88. Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In International Conference on Learning Representations, 2023.
  89. Testing the general deductive reasoning capacity of large language models using ood examples. Advances in Neural Information Processing Systems, 36, 2024.
  90. Murray Shanahan. Talking about large language models. Commun. ACM, 67(2):68–79, jan 2024. ISSN 0001-0782. doi: 10.1145/3624724. URL https://doi.org/10.1145/3624724.
  91. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:256459776.
  92. Steven Sloman. Causal Models: How People Think about the World and Its Alternatives. Oxford University Press, 08 2005. ISBN 9780195183115. doi: 10.1093/acprof:oso/9780195183115.001.0001. URL https://doi.org/10.1093/acprof:oso/9780195183115.001.0001.
  93. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap. arXiv preprint arXiv:2402.19450, 2024.
  94. A causal framework to quantify the robustness of mathematical reasoning with language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  545–561, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.32. URL https://aclanthology.org/2023.acl-long.32.
  95. A survey of reasoning with foundation models. arXiv preprint arXiv:2312.11562, 2024.
  96. Gemini: A family of highly capable multimodal models, 2023.
  97. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  98. Language models are not naysayers: an analysis of language models on negation benchmarks. In Alexis Palmer and Jose Camacho-collados (eds.), Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pp.  101–114, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.starsem-1.10. URL https://aclanthology.org/2023.starsem-1.10.
  99. Judgment under uncertainty: Heuristics and biases. Science, 185(4157):1124–1131, 1974. doi: 10.1126/science.185.4157.1124. URL https://www.science.org/doi/abs/10.1126/science.185.4157.1124.
  100. Strategies in sentential reasoning. Cognitive Science, 26(4):425–468, 2002. ISSN 0364-0213. URL https://www.sciencedirect.com/science/article/pii/S0364021302000745.
  101. Is einstein’s puzzle over-specified? In AJ75th Symposium: 21st Century Challenges in Computational Engineering & Science, Princeton University, Princeton, NJ, November 2009.
  102. Vicuna. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. https://vicuna.lmsys.org/, 2023. Accessed: 2024-03-15.
  103. A & b == b & a: Triggering logical reasoning failures in large language models. arXiv preprint arXiv:2401.00757, 2024.
  104. Gpt-j-6b: A 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021. URL https://github.com/kingoflolz/mesh-transformer-jax.
  105. Can ChatGPT defend its belief in truth? evaluating LLM reasoning via debate. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  11865–11881, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.795. URL https://aclanthology.org/2023.findings-emnlp.795.
  106. Fac22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTe: Better understanding large language model capabilities by dissociating language and cognition. arXiv preprint arXiv:2403.00126, 2024.
  107. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  24824–24837. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
  108. Can foundation models talk causality? In UAI 2022 Workshop on Causal Representation Learning, 2022. URL https://openreview.net/forum?id=DbJXEqU0kaM.
  109. Using cognitive interviews and think-aloud protocols to understand thought processes. Currents in Pharmacy Teaching and Learning, 13(2):181–188, 2021. ISSN 1877-1297. doi: https://doi.org/10.1016/j.cptl.2020.09.005. URL https://www.sciencedirect.com/science/article/pii/S1877129720303026.
  110. D. J. WOOD. Approach to the study of human reasoning. Nature, 223(5201):101–102, Jul 1969. ISSN 1476-4687. doi: 10.1038/223101a0. URL https://doi.org/10.1038/223101a0.
  111. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477, 2023.
  112. Are large language models really good logical reasoners? a comprehensive evaluation from deductive, inductive and abductive views. arXiv preprint arXiv:2306.09841, 2023.
  113. Logical reasoning over natural language as knowledge representation: A survey. arXiv preprint arXiv:2303.12023, 2023.
  114. Language models as inductive reasoners. arXiv preprint arXiv:2212.10923, 2024.
  115. Natural language reasoning, a survey. arXiv preprint arXiv:2303.14725, 2023a.
  116. IfQA: A dataset for open-domain question answering under counterfactual presuppositions. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023b. URL https://openreview.net/forum?id=V49Jx2Lj04.
  117. Causal parrots: Large language models may talk causality but are not causal. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=tv46tCzs83.
  118. On the paradox of learning to reason from data. arXiv preprint arXiv:2205.11502, 2022.
  119. Efficiently measuring the cognitive ability of llms: An adaptive testing perspective. arXiv preprint arXiv:2306.10512, 2023.
Citations (20)

Summary

  • The paper introduces a taxonomy differentiating core and integrated reasoning tasks to assess LLMs’ capabilities beyond surface-level accuracy.
  • It demonstrates that LLMs often rely on memorization and statistical patterns rather than systematic reasoning, particularly in logical and mathematical tasks.
  • The survey advocates novel evaluation frameworks, including rationale-based and interactive methods, to more accurately capture genuine reasoning behaviors in LLMs.

Evaluating the Reasoning Behavior of LLMs

The paper "Beyond Accuracy: Evaluating the Reasoning Behavior of LLMs -- A Survey" examines the reasoning capabilities of LLMs by moving beyond accuracy metrics to provide deeper insights into their reasoning behaviors. It highlights the limitations of current evaluations focused solely on task accuracy and suggests that LLMs rely heavily on surface-level patterns from training data rather than demonstrating true reasoning abilities. The paper introduces a categorization of reasoning tasks and explores evaluation methods that offer a more nuanced understanding of reasoning dynamics in LLMs.

Introduction to Reasoning in LLMs

The perennial question surrounding LLMs is whether their capabilities include real reasoning or whether they simply mimic reasoning due to large-scale data exposure. Past debates often categorize these models as "castles in the air" due to their reliance on vast parameters and training data without foundational reasoning skills. The paper aims to clarify these concerns by:

  • Providing insights into models' behaviors in diverse reasoning tasks.
  • Offering a taxonomy of methods for assessing reasoning beyond conventional accuracy.

The survey suggests the need for research delineating differences between human reasoning and LLM-based reasoning. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Schematic overview of the two types of reasoning tasks distinguished in this survey.

Reasoning Task Categorization

Two primary task types are identified:

  • Core Reasoning Tasks: Focus on assessing singular reasoning abilities, such as logical, mathematical, or causal tasks.
  • Integrated Reasoning Tasks: These require multiple reasoning skills concurrently, such as commonsense or scientific reasoning tasks.

The paper emphasizes evaluating core tasks, particularly logical, mathematical, and causal reasoning, to understand LLMs' reasoning behaviors in isolated settings.

Key Findings on Reasoning Behavior

Logical Reasoning

Differences in reasoning capabilities between model sizes are apparent. Larger models tend to perform better at generating valid and atomic steps but often produce steps with low utility. Furthermore, models frequently rely on statistical features rather than systematic reasoning. This impacts tasks involving logical operators like negations, which are understood poorly by many models.

Mathematical Reasoning

LLMs tend to memorize rather than reason systematically, as demonstrated by varied performance on mathematical problems formulated differently but based on the same reasoning process. The influence of term frequency from pre-training data highlights a model's struggle to generalize.

Causal Reasoning

Current models showcase limited intrinsic abilities to grasp or construct causal relationships outside of known factual patterns. Causal reasoning, particularly counterfactual reasoning tasks, reveals a specific area where models exhibit significant shortcomings.

Evaluation Methods for Reasoning

The taxonomy of evaluation methods includes:

  • Conclusion-Based Evaluation: Focuses on the model's final answer rather than the reasoning trace, although it can uncover conceptual errors and the model's confidence or preference for certain answers.
  • Rationale-Based Evaluation: Examines the reasoning trace or rationale generated by models, using structured parsing or qualitative inspections.
  • Interactive Evaluation: Engages the model dynamically during evaluation, including adaptive testing or dialectic evaluation methods.
  • Mechanistic Evaluation: Investigates the model's internal processes, using methods like layer probing or activation patching to understand the reasoning mechanics.

The paper suggests these methods can provide valuable insights that go beyond simple accuracy metrics.

Discussion and Implications

The research underscores differences between the apparent proficiency of LLMs in reasoning tasks and their genuine capability to reason. It highlights various challenges, particularly when tasked with scenarios trying the limits of their training data. The paper encourages developing more sophisticated reasoning evaluation methods that mirror human reasoning evaluation processes.

Conclusion

The survey concludes that while LLMs demonstrate some reasoning abilities, current evaluations fail to capture the depth of reasoning skills as observed in human cognition. There’s a need for innovative evaluation frameworks that focus on nuanced aspects of reasoning, providing a more comprehensive picture of LLMs' reasoning behaviors. This approach could guide future advancements in AI research, particularly in reasoning capabilities of LLMs.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube