Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models (2402.08955v1)
Abstract: LLMs have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, it has been debated whether they are actually performing humanlike abstract reasoning or instead employing less general processes that rely on similarity to what has been seen in their training data. Here we investigate the generality of analogy-making abilities previously claimed for LLMs (Webb, Holyoak, & Lu, 2023). We take one set of analogy problems used to evaluate LLMs and create a set of "counterfactual" variants-versions that test the same abstract reasoning abilities but that are likely dissimilar from any pre-training data. We test humans and three GPT models on both the original and counterfactual problems, and show that, while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set. This work provides evidence that, despite previously reported successes of LLMs on analogical reasoning, these models lack the robustness and generality of human analogy-making.
- (2023). Leaping across the mental canyon: Higher-order long-distance analogical retrieval. Journal of Cognitive Psychology, 35(8), 856–875.
- (2023). Faith and fate: Limits of transformers on compositionality. In Proceedings of the Thirty-seventh Annual Conference on Neural Information Processing Systems (NeurIPS).
- (2010). Functional neural correlates of fluid and crystallized analogizing. NeuroImage, 49(4), 3489–3497.
- (2023). Response: Emergent analogical reasoning in large language models. arXiv preprint arXiv:2308.16118.
- Hofstadter, D. R. (1985). Metamagical Themas: Questing for the Essence of Mind and Pattern. In (chap. 24). New York, NY: Basic Books.
- (1994). The Copycat project: A model of mental fluidity and analogy-making. In K. J. Holyoak J. A. Barnden (Eds.), Advances in Connectionist and Neural Computation Theory (Vol. 2, pp. 31–112). Ablex, Norwood, NJ.
- (2022). Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403.
- Kambhampati, S. (2023). Can LLMs really reason and plan? Communications of the ACM. (https://cacm.acm.org/blogs/blog-cacm/276268-can-llms-really-reason-and-plan/fulltext)
- (2015). Event-related potential responses to letter-string comparison analogies. Experimental Brain Research, 233, 1563–1573.
- (2023). Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638.
- Mitchell, M. (1993). Analogy-Making As Perception: A Computer Model. In (chap. 5). Cambridge, MA: MIT Press.
- (2022). Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 840–854).
- (2023). Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9), 1526–1541.
- (2022). Emergent abilities of large language models. Transactions on Machine Learning Research.
- (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
- (2023). Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477.