Emergent Mind

MMToM-QA: Multimodal Theory of Mind Question Answering

(2401.08743)
Published Jan 16, 2024 in cs.AI , cs.CL , cs.CV , and cs.LG

Abstract

Theory of Mind (ToM), the ability to understand people's minds, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly LLMs, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data, which can include visual cues, linguistic narratives, or both. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that LLMs and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.

Benchmark shows questions linked to video streams and text, testing model's multimedia information fusion ability.

Overview

  • The paper introduces MMToM-QA, a benchmark designed to evaluate Theory of Mind (ToM) in machines using multimodal data sources.

  • It presents a novel framework employing Bayesian inverse planning and language models to interpret video and text data for assessing AI ToM capabilities.

  • MMToM-QA combines videos and texts from domestic scenes with questions about the mental states of individuals, requiring the integration of both modality inputs.

  • The benchmark allows for comparisons between machine-generated results and human responses with the aid of a ground-truth annotated training set.

  • BIP-ALM, also introduced in this paper, demonstrates superior ToM reasoning and paves the way for future AI development in social intelligence.

Introduction to Multimodal Theory of Mind Benchmarking

The concept of Theory of Mind (ToM) represents the ability to attribute mental states to others, enabling individuals to predict and understand behaviors. In the pursuit of advancing social intelligence within artificial intelligence, a significant focus has been placed on evaluating machine ToM using a variety of benchmarks. Until now, these assessments have predominantly utilized unimodal datasets, restricted to either video or text. In real-world interactions, however, humans draw upon both visual and linguistic information to assess others’ mental states. To bridge this gap, a comprehensive Multimodal Theory of Mind question answering benchmark, named MMToM-QA, has been developed.

Evaluating ToM in AI

The MMToM-QA benchmark is designed to evaluate machine ToM using both video and text modalities, reflecting human-like reasoning about another person’s beliefs, goals, and plans within household scenarios. This novel approach employs a unique combination of Bayesian inverse planning—typically utilized for video data—and language models to interpret and analyze multimodal data. By integrating these elements, the benchmark offers an advanced method to appraise the ToM capabilities of machines against human performance. It particularly focuses on the capacity of AI systems to process multifaceted mental state problems, such as belief tracking over time and goal inferences under differing belief conditions.

The Multimodal Framework

MMToM-QA introduces a mixed modality input consisting of videos and textual descriptions from a domestic environment, accompanied by questions related to the mental states of individuals in the scene. The problems necessitate integrating both modalities to answer correctly. Refinement and evaluation of the processes can be undertaken using a ground-truth annotated training set, allowing for a detailed comparison between machine-generated and human responses. Additionally, the procedural generation of synthetic human activity data ensures scalability and expedited evaluation for AI models using the MMToM-QA benchmark.

Insights and Implications

While established LLMs and multimodal models show limited ToM reasoning, BIP-ALM—the proposed multimodal ToM model—demonstrates superior performance by incorporating robust Bayesian planning and the versatile reasoning abilities of language models. This approach shines not only in interpreting observed actions in the context of hypothesized mental states but also conveys promising strides toward mirroring human judgment. The creation of MMToM-QA and BIP-ALM emphasizes the need for multimodal understanding in social intelligence and suggests that machine ToM can greatly benefit from a more nuanced, hybridized approach. It marks a significant leap forward in ToM research with the potential to inform future AI development across various applications, ultimately paving the way for more socially aware artificial agents.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.