MMToM-QA: Multimodal Theory of Mind Question Answering (2401.08743v2)
Abstract: Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly LLMs, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by LLMs). BIP-ALM extracts unified representations from multimodal data and utilizes LLMs for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that LLMs and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and LLMs.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433, 2015.
- Action understanding as inverse planning. Cognition, 113(3):329–349, 2009.
- Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour, 1(4):1–10, 2017.
- Mindcraft: Theory of mind modeling for situated dialogue in collaborative tasks. In Conference on Empirical Methods in Natural Language Processing, 2021.
- A persistent spatial semantic representation for high-level natural language instruction execution. In Conference on Robot Learning, pp. 706–717. PMLR, 2022.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Stylepredict: Machine theory of mind for human driver behavior from trajectories. arXiv preprint arXiv:2011.04816, 2020.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Kerstin Dautenhahn. Socially intelligent robots: dimensions of human–robot interaction. Philosophical transactions of the royal society B: Biological sciences, 362(1480):679–704, 2007.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Baby intuitions benchmark (bib): Discerning the goals, preferences, and actions of others. Advances in Neural Information Processing Systems, 34:9963–9976, 2021.
- Andrew S. Gordon. Commonsense interpretation of triangle behavior. In AAAI Conference on Artificial Intelligence, 2016.
- Cooperative inverse reinforcement learning. In Advances in neural information processing systems, 2016.
- Exploring roberta’s theory of mind through textual entailment. 2021.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pp. 9118–9147. PMLR, 2022.
- Julian Jara-Ettinger. Theory of mind as inverse reinforcement learning. Current Opinion in Behavioral Sciences, 29:105–110, 2019.
- The naïve utility calculus: Computational principles underlying commonsense psychology. Trends in cognitive sciences, 20(8):589–604, 2016.
- Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Michal Kosinski. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 2023.
- Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017.
- Revisiting the evaluation of theory of mind through question answering. In Conference on Empirical Methods in Natural Language Processing, 2019.
- M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTit: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023.
- Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Boosting theory-of-mind performance in large language models via prompting. arXiv preprint arXiv:2304.11490, 2023.
- Evaluating theory of mind in question answering. arXiv preprint arXiv:1808.09352, 2018.
- Phase: Physically-grounded abstract social events for machine social perception. In Proceedings of the aaai conference on artificial intelligence, volume 35, pp. 845–853, 2021.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Proactive robot assistance via spatio-temporal object modeling. arXiv preprint arXiv:2211.15501, 2022.
- Watch-and-help: A challenge for social perception and human-ai collaboration. arXiv preprint arXiv:2010.09890, 2020.
- Nopa: Neurally-guided online probabilistic assistance for building socially intelligent home assistants. arXiv preprint arXiv:2301.05223, 2023.
- Machine theory of mind. In International conference on machine learning, pp. 4218–4227. PMLR, 2018.
- Multivent: Multilingual videos of events with aligned natural text. arXiv preprint arXiv:2307.03153, 2023.
- Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
- Neural theory-of-mind? on the limits of social intelligence in large lms. arXiv preprint arXiv:2210.13312, 2022.
- Rebecca Saxe. The happiness of the fish: Evidence for a common theory of one’s own and others’ actions. In Handbook of Imagination and Mental Simulation, pp. 257–309. Psychology Press, 2012.
- Symmetric machine theory of mind. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 19450–19466. PMLR, 17–23 Jul 2022a.
- Symmetric machine theory of mind. In International Conference on Machine Learning, pp. 19450–19466. PMLR, 2022b.
- Minding language models’ (lack of) theory of mind: A plug-and-play multi-character belief tracker, 2023.
- Clever hans or neural theory of mind? stress testing social reasoning in large language models. arXiv preprint arXiv:2305.14763, 2023.
- Agent: A benchmark for core psychological reasoning. In International Conference on Machine Learning, pp. 9614–9625. PMLR, 2021.
- Mimoqa: Multimodal input multimodal output question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5317–5332, 2021.
- Multimodalqa: Complex question answering over text, tables and images. arXiv preprint arXiv:2104.06039, 2021.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399, 2023.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- Towards mutual theory of mind in human-ai interaction: How language reflects what students perceive about a virtual teaching assistant. In Proceedings of the 2021 CHI conference on human factors in computing systems, pp. 1–14, 2021.
- Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13(1):103–128, 1983.
- Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8807–8817, 2019.
- From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6720–6731, 2019.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.