The 3D-PC: a benchmark for visual perspective taking in humans and machines (2406.04138v2)

Published 6 Jun 2024 in cs.CV and cs.HC

Abstract: Visual perspective taking (VPT) is the ability to perceive and reason about the perspectives of others. It is an essential feature of human intelligence, which develops over the first decade of life and requires an ability to process the 3D structure of visual scenes. A growing number of reports have indicated that deep neural networks (DNNs) become capable of analyzing 3D scenes after training on large image datasets. We investigated if this emergent ability for 3D analysis in DNNs is sufficient for VPT with the 3D perception challenge (3D-PC): a novel benchmark for 3D perception in humans and DNNs. The 3D-PC is comprised of three 3D-analysis tasks posed within natural scene images: 1. a simple test of object depth order, 2. a basic VPT task (VPT-basic), and 3. another version of VPT (VPT-Strategy) designed to limit the effectiveness of "shortcut" visual strategies. We tested human participants (N=33) and linearly probed or text-prompted over 300 DNNs on the challenge and found that nearly all of the DNNs approached or exceeded human accuracy in analyzing object depth order. Surprisingly, DNN accuracy on this task correlated with their object recognition performance. In contrast, there was an extraordinary gap between DNNs and humans on VPT-basic. Humans were nearly perfect, whereas most DNNs were near chance. Fine-tuning DNNs on VPT-basic brought them close to human performance, but they, unlike humans, dropped back to chance when tested on VPT-Strategy. Our challenge demonstrates that the training routines and architectures of today's DNNs are well-suited for learning basic 3D properties of scenes and objects but are ill-suited for reasoning about these properties as humans do. We release our 3D-PC datasets and code to help bridge this gap in 3D perception between humans and machines.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces the 3D-PC benchmark to assess visual perspective taking in humans and machines through three naturalistic tasks.
It reveals a notable performance gap where DNNs excel at depth ordering yet struggle with complex VPT tasks compared to humans.
Results show a strong link between object recognition accuracy and 3D perception, underscoring the need for innovative, embodied learning methods.

The 3D-PC: A Benchmark for Visual Perspective Taking in Humans and Machines

The 3D-PC: a benchmark for visual perspective taking in humans and machines by Linsley et al. introduces a novel benchmark designed to evaluate the 3D perceptual capabilities of both humans and deep neural networks (DNNs). This benchmark, termed the 3D Perception Challenge (3D-PC), addresses the crucial cognitive task of Visual Perspective Taking (VPT), which is the ability to perceive and reason about the visual perspectives of others.

Summary

Visual Perspective Taking (VPT) is indispensable for human intelligence, aiding in social interactions and navigation. Despite its importance, the capability of DNNs to perform VPT remains under-investigated. The 3D-PC proposed by Linsley et al. aims to bridge this examination gap by offering a robust framework for testing 3D perception in both humans and machines through three distinct tasks: depth order, VPT-basic, and VPT-Strategy.

Methods and Experimental Setup

The 3D-PC comprises three tasks embedded within natural scene images:

Depth Order: A task assessing the relative depth ordering of objects.
VPT-basic: A more straightforward VPT task.
VPT-Strategy: A complex version of VPT-basic designed to minimize the effectiveness of shortcut strategies.

Testing was conducted on 33 human participants and over 300 DNN models, covering a diverse range of architectures and training paradigms. Notably, the DNN models included state-of-the-art architectures such as Visual Transformers (ViT), DINO v2, and CLAUDE 3.

Key Findings

Divergence in Performance:
- Depth Order: DNNs demonstrated performance comparable to, or exceeding, human participants in discerning object depth order. Their performance showed a strong correlation with their object classification accuracy on ImageNet.
- VPT-basic: A significant performance gap was observed between humans and DNNs. Humans were nearly flawless, while most DNNs performed at chance level.
- Fine-tuning DNNs on VPT-basic improved their performance to near-human levels. However, when these models were tested on VPT-Strategy, their performance fell back to chance.
Correlations with Object Recognition: DNN performance on both depth order and VPT-basic tasks was notably correlated with their object recognition accuracy on ImageNet. This suggests that as DNNs scale, emergent capabilities for processing 3D properties coincide with improvements in object recognition.
Visual Strategies: Despite fine-tuning, DNNs failed the VPT-Strategy task, highlighting the reliance on brittle feature-based strategies rather than robust perspective-taking methods, which humans employ naturally.

Implications

The research underscores the inadequacy of current DNN architectures and training routines in emulating human-like 3D perceptual reasoning, especially concerning VPT. It calls attention to:

Limitations of Static Image Training: While training on large-scale static image datasets develops basic 3D perceptual capabilities in DNNs, it falls short in enabling sophisticated reasoning akin to human VPT.
Need for Embodied Learning: The findings advocate for incorporating insights from human cognition and neuroscience, particularly emphasizing embodied experiences, to develop DNNs with improved 3D reasoning.
Benchmarking as a Tool for Advancements: By providing the 3D-PC data and code, this research facilitates the broader scientific community in addressing the limitations of current models, potentially leading to AI systems that better understand and anticipate human behavior.

Future Directions

This research opens several avenues for future exploration:

Enhanced Training Paradigms: Investigating hybrid training frameworks that include dynamic and interactive environments could lead to more robust 3D reasoning in DNNs.
Architectural Innovations: Developing novel neural network architectures that more closely mimic the feedforward and feedback processes of the human visual system may bridge the current performance gaps.
Broader VPT Tasks: Expanding the variety of VPT tasks within the 3D-PC could provide deeper insights into the specific capabilities and limitations of DNNs in 3D perception.

In conclusion, while the 3D-PC highlights commendable advancements in DNN capabilities for 3D perception, it also elucidates the significant challenges that remain in achieving human-like VPT in machines. The development and dissemination of the 3D-PC benchmark stand as pivotal steps towards fostering innovation and progress in this crucial area of artificial intelligence.

PDF Markdown

Related Papers

Tweets

https://twitter.com/DrewLinsley/status/1803537490189680738

https://twitter.com/DrewLinsley/status/1803531908489584972