Emergent Mind

When and why vision-language models behave like bags-of-words, and what to do about it?

(2210.01936)
Published Oct 4, 2022 in cs.CV , cs.AI , cs.CL , and cs.LG

Abstract

Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO & Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the evaluation and training procedures. We demonstrate that it is possible to perform well on retrieval over existing datasets without using the composition and order information. Given that contrastive pretraining optimizes for retrieval on datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring understanding of order and compositionality.

Benchmark ARO tests model understanding of composition, relation, and order in captions; highlights VLM deficiencies.

Overview

  • This paper introduces the ARO benchmark to evaluate the ability of vision and language models (VLMs) to understand compositional relationships, attributes, and order, revealing substantial deficiencies.

  • The ARO benchmark consists of tasks assessing models' comprehension of object attributes, relational dynamics, and sequence order through visual narratives, with models like CLIP and BLIP being evaluated.

  • Investigation into standard evaluation metrics and training procedures highlights that current VLMs may neglect deeper compositional and sequential details due to their training optimization focus.

  • The study proposes a training modification through composition-aware hard negative mining, showing improved VLM performance in comprehending compositional and sequential information.

Evaluating Vision-Language Models' Comprehension of Composition and Order

Introduction

Vision and language models (VLMs) have shown remarkable capabilities in various benchmark tasks, yet their proficiency in comprehending compositional relationships, attributes, and order remains underexplored. Through the Attribution, Relation, and Order (ARO) benchmark, this study systematically examines these aspects of VLM understanding. ARO involves more than 50,000 test cases across four tasks, offering a comprehensive evaluation of how well VLMs grasp object properties, relational dynamics, and sequential information in visual narratives. Our findings reveal that despite being trained on extensive datasets rich in compositional detail, current VLMs exhibit substantial deficiencies in these areas.

The ARO Benchmark

The ARO benchmark is designed to probe VLMs' understanding in three principal domains:

  • Visual Genome Attribution and Relation: These two tasks assess models' abilities to comprehend the attributes of objects and their relational dynamics within an image, respectively. The challenge lies in distinguishing correct from incorrect attributions or relations, such as identifying "the horse is eating the grass" as correct over "the grass is eating the horse."
  • COCO-Order & Flickr30k-Order: These tasks focus on the models' sensitivity to the order of information, presenting VLMs with both correctly ordered captions and their permutations. Models must identify the caption that accurately describes the image sequence.

The performance of several leading VLMs, including CLIP, BLIP, FLAVA, and X-VLM, was evaluated. The results indicate a pervasive struggle across models to accurately represent compositional information, with particular difficulty in relational understanding and order sensitivity.

Limitations of Current Evaluation Protocols

A closer look at standard evaluation metrics and training procedures offers insights into the observed limitations. Notably, VLMs can achieve high performance on image-text retrieval tasks — a common evaluation metric — without accurately comprehending order or composition. This raises questions about the adequacy of such tasks in capturing the depth of VLMs' understanding.

Further examination suggests that the prevailing contrastive pretraining approach aligns closely with retrieval task objectives. However, this alignment, combined with the absence of a strong emphasis on compositional variations in training datasets, may inadvertently encourage models to overlook richer compositional and sequential details. In essence, models may not perceive a necessity to encode this information deeply, as it does not significantly impact their performance on the tasks for which they are optimized.

Advancing Compositional Understanding

In response to these findings, this study proposes a modification to the traditional training methodology through composition-aware hard negative mining. By incorporating more challenging negatives that emphasize compositional distinctions and order relevance, we demonstrate that VLMs can indeed develop a more refined understanding of these aspects. Experimental results reveal that this straightforward modification notably improves VLM performance on tasks requiring deep compositional understanding, without compromising their capabilities in other benchmark tasks.

Conclusion

Our research presents a critical evaluation of VLMs' abilities to comprehend compositional relationships, attributes, and order information, employing the ARO benchmark. The results expose significant shortcomings in current models, suggesting a need for reevaluating existing training and evaluation practices. By incorporating composition-aware hard negatives in the training process, we offer a viable path forward in enhancing VLMs' understanding of complex visual narratives. As the field continues to progress, fostering a deeper comprehension of composition and order within VLMs remains an imperative pursuit, promising advancements in their applicability and performance across a broader array of tasks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.