Emergent Mind

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

(2403.09193)
Published Mar 14, 2024 in cs.CV , cs.AI , cs.LG , and q-bio.NC

Abstract

Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.

VLMs prioritize shape over texture in object recognition, influenced by language but limited compared to vision.

Overview

  • This paper investigates the balance between texture vs. shape bias in Vision Language Models (VLMs) and explores the possibility of modulating these biases through linguistic prompts.

  • Research reveals that unlike vision-only models which prefer texture, VLMs tend to exhibit a more pronounced shape bias, similar to human visual perception, but still fall short of fully emulating it.

  • The study examines how task-specific prompting and visual input alterations can influence VLMs' texture and shape biases, finding that biases can be steered to a degree through linguistic and visual manipulations.

  • The findings suggest both theoretical and practical implications for the future of multimodal learning and the development of more nuanced visual understanding in technology.

Exploring the Texture vs. Shape Bias in Vision Language Models (VLMs)

Introduction

Vision Language Models (VLMs) have evolved to be a pivotal component in the intersection of computer vision and natural language processing, enabling a myriad of applications from zero-shot image classification to comprehensive image captioning. A fascinating question that arises in the context of VLMs is their alignment with human visual perception, particularly in how they navigate the balance between texture and shape bias. Historically, vision-only models displayed a pronounced preference for texture over shape, a pattern that diverges from human visual tendencies which favor shape. This paper explores the texture vs. shape bias within various VLMs and assesses whether the bias can be moderated or redirected through linguistic prompts, laying the groundwork for deeper inquiry into how these models perceive and interpret visual information.

Texture vs. Shape Bias in VLMs

An exhaustive analysis of popular VLMs reveals a nuanced landscape where, contrary to prior vision-only models, many VLMs exhibit a stronger inclination toward shape bias when processing visual information. This shift suggests that multimodal training involving both text and images does not merely transplant vision encoders' biases into VLMs but instead modulates these biases through linguistic integration. Crucially, while VLMs demonstrate a more shape-oriented approach than their vision-only counterparts, they still fall short of replicating the human propensity to prioritize shape significantly. Notably, certain models demonstrate an ability to adjust their bias based on the task, displaying varying levels of shape preference in tasks like visual question answering (VQA) and image captioning.

Investigation of Bias Modulation

The central inquiry into whether and how the visual biases in VLMs can be influenced through language reveals compelling outcomes. By employing task-specific prompting and altering the visual input (through pre-processing techniques like patch shuffling and noise addition), the study explore the malleability of shape and texture biases. Intriguingly, text-based manipulations underscore the possibility of steering these biases to a considerable degree, albeit not as substantially as through visual alterations. This discovery opens intriguing avenues for research into the interplay between textual and visual information in guiding model perception.

Implications and Future Directions

The findings of this study have broad implications, both theoretical and practical. On a theoretical level, the evidence that VLMs’ visual biases can be partially steered through linguistic inputs enriches our understanding of multimodal learning dynamics and the complex interplay between text and image processing. Practically, the ability to modulate visual biases in VLMs could enhance model performance across tasks that require nuanced visual understanding, from improved accessibility tools to more accurate visual search and annotation systems.

Looking ahead, this exploration sets the stage for further studies into the multimodal workings of VLMs, encouraging a deeper dive into the mechanisms that underpin bias modulation. Additionally, given the rapid evolution of VLM technologies, future work could extend beyond texture and shape bias to uncover other potential biases and the extent to which they can be shaped through multimodal interactions.

Conclusion

This paper provides a foundational exploration of the texture vs. shape bias in VLMs, revealing a marked departure from the tendencies observed in vision-only models. Through meticulous experimentation, it establishes that while VLMs naturally exhibit a more significant shape bias, this bias can be influenced, albeit modestly, through linguistic prompts. These insights not only enrich our understanding of VLMs’ operational dynamics but also offer practical pathways to enhance their alignment with human visual perception, marking a significant step forward in the quest to create more intuitive and effective multimodal models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.