Visual cognition in multimodal large language models (2311.16093v3)

Published 27 Nov 2023 in cs.LG

Abstract: A chief goal of artificial intelligence is to build machines that think like people. Yet it has been argued that deep neural network architectures fail to accomplish this. Researchers have asserted these models' limitations in the domains of causal reasoning, intuitive physics, and intuitive psychology. Yet recent advancements, namely the rise of LLMs, particularly those designed for visual processing, have rekindled interest in the potential to emulate human-like cognitive abilities. This paper evaluates the current state of vision-based LLMs in the domains of intuitive physics, causal reasoning, and intuitive psychology. Through a series of controlled experiments, we investigate the extent to which these modern models grasp complex physical interactions, causal relationships, and intuitive understanding of others' preferences. Our findings reveal that, while some of these models demonstrate a notable proficiency in processing and interpreting visual data, they still fall short of human capabilities in these areas. Our results emphasize the need for integrating more robust mechanisms for understanding causality, physical dynamics, and social cognition into modern-day, vision-based LLMs, and point out the importance of cognitively-inspired benchmarks.

References (137)

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates that state-of-the-art vision-based LLMs excel in basic visual tasks but struggle with complex intuitive physics and causal reasoning challenges.
It employs controlled experiments on block stability and object interactions to highlight discrepancies between model performance and human reasoning.
The study reveals significant deficits in intuitive psychology tasks, underscoring the need for enhanced cognitive benchmarks in AI systems.

In recent years, advances in AI have led to the development of highly sophisticated models that can interpret and respond to visual and textual information—so sophisticated, in fact, that we might wonder whether these models have started to "think" like humans. In particular, vision-based LLMs, which include visual processing, have demonstrated impressive capabilities. However, research indicates that these models still do not fully emulate human cognitive processes in key areas.

The paper in focus evaluates the capabilities of several modern vision LLMs across three specific cognitive domains: intuitive physics, causal reasoning, and intuitive psychology. Intuitive physics involves predicting and understanding physical interactions; causal reasoning deals with understanding cause-and-effect relationships; and intuitive psychology involves inferring the mental states and intentions of others. Despite their complexity, these are areas where even young children demonstrate significant proficiency, suggesting that understanding and replicating these abilities is crucial for developing AI that truly mimics human thinking.

Through a series of experiments, the researchers investigated the performance of the models in tasks such as predicting the stability of block towers and inferring the potential outcomes of removing certain blocks. GPT-4, one of the largest models with a visual processing component (denoted as GPT-4V), and several other models were put to the test. They found that although models like GPT-4V were proficient at elementary tasks like identifying colors or counting objects in an image, they struggled when the tasks required more complex reasoning about physics and causality. Surprisingly, none of the models matched human performance levels in these cognitive domains.

Additionally, the models also failed to demonstrate any significant aptitude in intuitive psychology tasks, which require an understanding of others' preferences based on visual cues. The failure in this domain was noteworthy across all models tested.

The upshot is that, while modern vision-based LLMs have become quite adept at processing visual information, their capacity for deep reasoning and understanding of intuitive human concepts remains limited. The paper concludes that integrating more advanced mechanisms for causality, physical dynamics, and social cognition is necessary for further advancement. It also highlights the importance of developing benchmarks inspired by cognitive science to appropriately evaluate these AI models.

The research is a critical step in the continued effort to improve AI systems. It sheds light on current limitations and paves the way for future work exploring a broader range of cognitive domains and model variations. Nonetheless, the complexity of human cognition continues to pose a challenge to the current state of technology, reflecting the nuanced and multifaceted nature of our intellect. As AI models evolve, so too must the methods and benchmarks we use to measure their approximation of the human mind.