KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models (2407.17773v3)

Published 25 Jul 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children. A "visual analogy" is an abstract rule inferred from one image and applied to another. While benchmarks exist for testing visual reasoning in LMMs, they require advanced skills and omit basic visual analogies that even young children can make. Inspired by developmental psychology, we propose a new benchmark of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning and compare them to children (ages three to five) and to adults. We structure the evaluation into three stages: identifying what changed (e.g., color, number, etc.), how it changed (e.g., added one object), and applying the rule to new scenarios. Our findings show that while GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS identify the "what" effectively, they struggle with quantifying the "how" and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages. Additionally, the strongest tested model, GPT-o1, performs better in tasks involving simple surface-level visual attributes like color and size, correlating with quicker human adult response times. Conversely, more complex tasks such as number, rotation, and reflection, which necessitate extensive cognitive processing and understanding of extrinsic spatial properties in the physical world, present more significant challenges. Altogether, these findings highlight the limitations of training models on data that primarily consists of 2D images and text.

Citations (3)

View on Semantic Scholar

Summary

The paper presents the KiVA benchmark that tests LMMs through change detection, specifying alterations, and applying inferred rules.
The study shows that models like GPT-4V excel at simple tasks such as color and size changes but struggle with complex numerical and spatial transformations.
The research highlights the need to enhance model architectures and training data to better emulate human-like visual cognition and analogical reasoning.

Evaluating Visual Analogical Reasoning in Large Multimodal Models: Insights from the KiVA Benchmark

The paper "KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models," authored by researchers from the University of California, Berkeley, Boston University, and Google DeepMind, presents an in-depth investigation into the visual analogical reasoning abilities of large multimodal models (LMMs). The paper is inspired by developmental psychology and asserts that current benchmark evaluations for LMMs do not adequately account for basic visual analogies that accessible to young children.

Introduction

Visual analogical reasoning involves drawing parallels between visual scenarios and applying abstract rules inferred from one context to another. While human beings, including children, can effectively perform these tasks, LMMs have not been comprehensively tested in this capacity. The paper introduces a novel benchmark—KiVA, comprising 1,400 visual transformations of everyday objects. This benchmark is used to test LMMs' visual analogical reasoning abilities in a structured evaluation composed of three stages: identifying what changed, how it changed, and applying the rule to novel objects.

Methodology

The authors delineate the testing procedure into three explicit stages:

Cross-domain change detection: Identifying what changed in the image.
Within-domain change detection: Specifying how the identified attribute changed.
Visual Extrapolation: Applying the inferred rule to new scenarios.

Each visual analogy involves transformations across domains such as color, size, number, rotation, and reflection.

Results

The findings are notably robust. LMMs including GPT-4V, LLaVA-1.5, and MANTIS show competency in detecting certain types of changes, particularly simpler transformations like color and size. However, they struggle significantly with more complex analogical tasks such as those involving numerical and spatial transformations:

Quantitative Results: While GPT-4V outperformed other models, its performance sharply declined in more complex tasks. The highest success was noted in tasks involving color and size, reflecting a limitation likely tied to the nature of the training data, which predominantly consists of 2D images and text.
Human vs. Model Performance: Humans, including young children, demonstrated consistently strong performance across all stages, underscoring a wide gap in analogical reasoning abilities between humans and current LMMs.

Discussion

The decline in model performance across increasingly complex tasks highlights notable limitations in current LMM architectures. These models are primarily trained on vast datasets composed of 2D images and text, aligning with simpler visual properties present in their training data. More extensive cognitive processes, necessary for understanding 3D physical changes—like object rotation, reflection, or numerical alteration—remain challenging for them. The paper's results align with prior research indicating that tasks requiring greater cognitive processing, often encoded deeper within the neural and developmental processes of humans, present significant hurdles for artificial models.

Implications and Future Work

Theoretical Implications: The results shed light on distinct disparities in visual cognition between humans and machines, particularly in analogical reasoning realms. This reveals that, despite progress in AI, current LMMs do not yet approximate the nuanced and flexible cognitive abilities inherent in humans—even at an elementary level present in young children.

Practical Implications: From an application standpoint, the inability of LMMs to generalize complex visual transformations poses questions on their deployment in real-world scenarios that require high levels of analogical reasoning. Applications in areas such as autonomous driving, healthcare imaging, and interactive AI would need substantial advancements before relying on these models for critical tasks.

Future Developments: For future research, it becomes imperative to construct training regimes and architectures that can imbibe and process 3D physical transformations more effectively. Additionally, the KiVA benchmark itself, particularly its subset for adults, can serve as a more stringent testbed for future model evaluations. Exploring multimodal models that integrate richer sensor data and embodied experience might provide an insight into overcoming the current limitations.

Conclusion

The KiVA benchmark presents a rigorous and insightful framework for evaluating visual analogical reasoning in LMMs. While these models show some capacity for recognizing basic visual transformations, they fall short of matching the analogical reasoning abilities of humans, including children as young as three years old. This paper calls for a reevaluation of training data and model architecture to bridge the observed gap, potentially paving the way for AI systems that can more closely emulate human cognitive processes in visual reasoning tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/shiryginosar/status/1817985553247474008

https://twitter.com/eunice_yiu_/status/1818018743433494620

https://twitter.com/shiryginosar/status/1841006188705550413

https://twitter.com/shiryginosar/status/1840053516997669240