Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 72 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 43 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 219 tok/s Pro
GPT OSS 120B 465 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

LLMs achieve adult human performance on higher-order theory of mind tasks (2405.18870v2)

Published 29 May 2024 in cs.AI, cs.CL, and cs.HC

Abstract: This paper examines the extent to which LLMs have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Paying attention to inattentive survey respondents. Political Analysis, 27(2):145–162, 2019.
  3. Does the autistic child have a “theory of mind”? Cognition, 21(1):37–46, 1985.
  4. Google colaboratory. Building machine learning and deep learning models on google cloud platform: a comprehensive guide for beginners, pages 59–64, 2019.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. The problematic concept of native speaker in psycholinguistics: Replacing vague and harmful terminology with inclusive and accurate measures. Frontiers in psychology, 12:715843, 2021.
  8. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  9. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
  10. Michael C Corballis. The evolution of language. 2017.
  11. Negotiating with other minds: the role of recursive theory of mind in negotiation with incomplete information. Autonomous Agents and Multi-Agent Systems, 31:250–287, 2017.
  12. Higher-order theory of mind is especially useful in unpredictable negotiations. Autonomous Agents and Multi-Agent Systems, 36(2):30, 2022.
  13. Robin IM Dunbar. The social brain: mind, language, and society in evolutionary perspective. Annual review of Anthropology, 32(1):163–181, 2003.
  14. A mechanism-based approach to mitigating harms from persuasive generative ai. arXiv preprint arXiv:2404.15058, 2024.
  15. Camila Fernández. Mindful storytellers: Emerging pragmatics and theory of mind development. First Language, 33(1):20–46, 2013.
  16. The ethics of advanced ai assistants. arXiv preprint arXiv:2404.16244, 2024.
  17. Understanding social reasoning in language models with language models. Advances in Neural Information Processing Systems, 36, 2024.
  18. Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models. arXiv preprint arXiv:2310.16755, 2023.
  19. Fritz Heider. Attitudes and cognitive organization. The Journal of psychology, 21(1):107–112, 1946.
  20. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  21. Mentalizing about emotion and its relationship to empathy. Social cognitive and affective neuroscience, 3(3):204–217, 2008.
  22. Nicholas K Humphrey. The social function of intellect. 1976.
  23. Gender differences in verbal ability: A meta-analysis. Psychological bulletin, 104(1):53, 1988.
  24. IBM Corp. Released 2021. IBM SPSS Statistics for Windows, Version 28.0.1.0. Armonk, NY: IBM Corp.
  25. Limits on theory of mind use in adults. Cognition, 89(1):25–41, 2003.
  26. Theory-of-mind deficits and causal attributions. British journal of Psychology, 89(2):191–204, 1998.
  27. Michal Kosinski. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 2023.
  28. Theory of mind and emotion understanding predict moral development in early childhood. British Journal of Developmental Psychology, 28(4):871–889, 2010.
  29. Ventromedial prefrontal volume predicts understanding of others and social network size. Neuroimage, 57(4):1624–1629, 2011.
  30. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
  31. Bertram F Malle. How the mind explains behavior. Folk explanation, Meaning and social interaction. Massachusetts: MIT-Press, 2004.
  32. Patrick McGuiness. Gpt-4 details revealed. 12 July 2023. URL https://patmcguinness.substack.com/p/gpt-4-details-revealed.
  33. The debate over understanding in ai’s large language models. Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023.
  34. Steven Mithen. The prehistory of the mind: The cognitive origins of art and science. Thames & Hudson Ltd., 1996.
  35. The emergence of recursion in human language: Mentalising predicts recursive syntax task performance. Journal of Neurolinguistics, 43:95–106, 2017.
Citations (22)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that GPT-4 and Flan-PaLM achieve human-level performance on complex higher-order ToM tasks using the MoToMQA benchmark.
  • The study employs rigorous testing with controlled human and LLM experiments, highlighting the impact of prompt design and anchoring effects.
  • Results indicate that model size and fine-tuning are crucial for attaining advanced ToM capabilities, with implications for AI ethics and real-world applications.

LLMs and Theory of Mind Tasks

The paper "LLMs achieve adult human performance on higher-order theory of mind tasks" (2405.18870) presents an extensive paper on the capabilities of LLMs in executing Theory of Mind (ToM) tasks, which involves reasoning about recursive mental and emotional states. It introduces a new benchmark, Multi-Order Theory of Mind Question {content} Answer (MoToMQA), to test LLMs against human performance.

Introduction to Theory of Mind in LLMs

Theory of Mind (ToM) is a critical aspect of human social intelligence, enabling individuals to infer and interpret the mental states of others. Previous studies focused primarily on second-order ToM tasks, but this paper extends the analysis to orders 2 through 6, using a novel benchmark based on the Imposing Memory Task (IMT). The findings indicate that GPT-4 and Flan-PaLM achieve adult-level ToM capability, particularly excelling at higher-order tasks.

Experimental Setup and Procedures

Human Testing

Human participants were screened for English proficiency and assigned to read and respond to stories containing ToM and factual statements. The survey setup included story conditions that controlled for recall ability, and question ordering to test potential anchoring effects.

LLM Testing

Five models, including GPT-3.5, GPT-4, LaMDA, PaLM, and Flan-PaLM, were subjected to the MoToMQA. The LLMs processed stories and provided responses based on log probabilities of candidate answers. The paper considered the influence of prompt design and anchoring effects on LLM performance.

Results

The paper revealed significant performance disparities among the tested models on ToM tasks, with GPT-4 and Flan-PaLM showing the highest accuracy, comparable to human performance. Figure 1

Figure 1: Performance of humans and various LLMs on ToM tasks up to order 6.

  • GPT-4 Performance: Exceeds human performance on 6th-order ToM tasks, indicating a strong ability to manage complex recursive reasoning.
  • Flan-PaLM Performance: Achieves near-human performance but struggles with higher-order tasks relative to GPT-4.
  • Comparison Among Models: Smaller and non-finetuned models like LaMDA exhibit lower ToM capacities compared to their larger, instruction-finetuned counterparts, illustrating the importance of model size and finetuning.

Discussion

The paper underscores the relationship between LLM size and ToM capabilities, suggesting that beyond a certain scale, LLMs develop advanced cognitive abilities. This might be attributed to scaling laws governing neural architecture efficiency. Additionally, finetuned models like Flan-PaLM show improved ToM performance due to their enhanced capacity for following human-like instructions.

High-performance models have practical implications in real-world applications involving complex social interactions and moral reasoning. However, there are inherent risks associated with LLMs possessing advanced ToM abilities, such as the potential for manipulation, demanding robust ethics and alignment strategies.

Limitations and Future Research

The paper's constraints include limited linguistic diversity and evaluation scope, emphasizing English-only benchmarks. Further research is needed to develop culturally comprehensive datasets and extend the tasks to higher ToM orders. Exploring multimodal training paradigms could also enhance the generalizability and robustness of ToM capabilities in LLMs.

Conclusion

The paper demonstrates that LLMs, specifically GPT-4 and Flan-PaLM, can perform ToM tasks at a level commensurate with adult humans, marking significant progress in AI's cognitive modeling abilities. This highlights the necessity for continuous research into the mechanisms underpinning these capabilities and their implications for AI ethics and alignment.

These findings offer a new perspective on assessing cognitive tasks executed by LLMs, challenging preconceived notions about their understanding and proposing a requirement to recognize LLM-induced cognitive equivalence.

Youtube Logo Streamline Icon: https://streamlinehq.com