Emergent Mind

Visual cognition in multimodal large language models

(2311.16093)
Published Nov 27, 2023 in cs.LG

Abstract

A chief goal of artificial intelligence is to build machines that think like people. Yet it has been argued that deep neural network architectures fail to accomplish this. Researchers have asserted these models' limitations in the domains of causal reasoning, intuitive physics, and intuitive psychology. Yet recent advancements, namely the rise of LLMs, particularly those designed for visual processing, have rekindled interest in the potential to emulate human-like cognitive abilities. This paper evaluates the current state of vision-based LLMs in the domains of intuitive physics, causal reasoning, and intuitive psychology. Through a series of controlled experiments, we investigate the extent to which these modern models grasp complex physical interactions, causal relationships, and intuitive understanding of others' preferences. Our findings reveal that, while these models demonstrate a notable proficiency in processing and interpreting visual data, they still fall short of human capabilities in these areas. The models exhibit a rudimentary understanding of physical laws and causal relationships, but their performance is hindered by a lack of deeper insights - a key aspect of human cognition. Furthermore, in tasks requiring an intuitive theory of mind, the models fail altogether. Our results emphasize the need for integrating more robust mechanisms for understanding causality, physical dynamics, and social cognition into modern-day, vision-based language models, and point out the importance of cognitively-inspired benchmarks.

Overview of cognitive domains, visual question-answering approach, and multi-modal large language models used.

Overview

  • The paper evaluates modern vision-based LLMs against human cognitive processes in intuitive physics, causal reasoning, and intuitive psychology.

  • Through experiments, it was found that models like GPT-4V excel in simple tasks but struggle with complex reasoning and understanding human concepts.

  • Despite advancements, these AI models did not reach human performance levels in tasks requiring deep reasoning about physics, causality, and social cognition.

  • The research indicates a need for integration of more sophisticated mechanisms for causality and social understanding in AI models.

  • The paper stresses the necessity of cognitive science-inspired benchmarks to assess AI models and suggests that current technology still falls short in replicating human cognition.

In recent years, advances in AI have led to the development of highly sophisticated models that can interpret and respond to visual and textual information—so sophisticated, in fact, that we might wonder whether these models have started to "think" like humans. In particular, vision-based LLMs, which include visual processing, have demonstrated impressive capabilities. However, research indicates that these models still do not fully emulate human cognitive processes in key areas.

The paper in focus evaluates the capabilities of several modern vision LLMs across three specific cognitive domains: intuitive physics, causal reasoning, and intuitive psychology. Intuitive physics involves predicting and understanding physical interactions; causal reasoning deals with understanding cause-and-effect relationships; and intuitive psychology involves inferring the mental states and intentions of others. Despite their complexity, these are areas where even young children demonstrate significant proficiency, suggesting that understanding and replicating these abilities is crucial for developing AI that truly mimics human thinking.

Through a series of experiments, the researchers investigated the performance of the models in tasks such as predicting the stability of block towers and inferring the potential outcomes of removing certain blocks. GPT-4, one of the largest models with a visual processing component (denoted as GPT-4V), and several other models were put to the test. They found that although models like GPT-4V were proficient at elementary tasks like identifying colors or counting objects in an image, they struggled when the tasks required more complex reasoning about physics and causality. Surprisingly, none of the models matched human performance levels in these cognitive domains.

Additionally, the models also failed to demonstrate any significant aptitude in intuitive psychology tasks, which require an understanding of others' preferences based on visual cues. The failure in this domain was noteworthy across all models tested.

The upshot is that, while modern vision-based LLMs have become quite adept at processing visual information, their capacity for deep reasoning and understanding of intuitive human concepts remains limited. The paper concludes that integrating more advanced mechanisms for causality, physical dynamics, and social cognition is necessary for further advancement. It also highlights the importance of developing benchmarks inspired by cognitive science to appropriately evaluate these AI models.

The research is a critical step in the continued effort to improve AI systems. It sheds light on current limitations and paves the way for future work exploring a broader range of cognitive domains and model variations. Nonetheless, the complexity of human cognition continues to pose a challenge to the current state of technology, reflecting the nuanced and multifaceted nature of our intellect. As AI models evolve, so too must the methods and benchmarks we use to measure their approximation of the human mind.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

HackerNews
Reddit
Have we built machines that think like people? (5 points, 3 comments) in /r/agi