Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Can We Talk Models Into Seeing the World Differently? (2403.09193v2)

Published 14 Mar 2024 in cs.CV, cs.AI, cs.LG, and q-bio.NC

Abstract: Unlike traditional vision-only models, vision LLMs (VLMs) offer an intuitive way to access visual content through language prompting by combining a LLM with a vision encoder. However, both the LLM and the vision encoder come with their own set of biases, cue preferences, and shortcuts, which have been rigorously studied in uni-modal models. A timely question is how such (potentially misaligned) biases and cue preferences behave under multi-modal fusion in VLMs. As a first step towards a better understanding, we investigate a particularly well-studied vision-only bias - the texture vs. shape bias and the dominance of local over global information. As expected, we find that VLMs inherit this bias to some extent from their vision encoders. Surprisingly, the multi-modality alone proves to have important effects on the model behavior, i.e., the joint training and the language querying change the way visual cues are processed. While this direct impact of language-informed training on a model's visual perception is intriguing, it raises further questions on our ability to actively steer a model's output so that its prediction is based on particular visual cues of the user's choice. Interestingly, VLMs have an inherent tendency to recognize objects based on shape information, which is different from what a plain vision encoder would do. Further active steering towards shape-based classifications through language prompts is however limited. In contrast, active VLM steering towards texture-based decisions through simple natural language prompts is often more successful. URL: https://github.com/paulgavrikov/vlm_shapebias

References (85)

Citations (9)

View on Semantic Scholar

Summary

The paper investigates how task-specific prompts and visual pre-processing can influence bias in vision language models.
It employs techniques such as patch shuffling and noise addition to assess the malleability of shape and texture preferences.
The study finds that while VLMs lean toward a shape bias, they do not fully replicate human visual perception.

Exploring the Texture vs. Shape Bias in Vision LLMs (VLMs)

Introduction

Vision LLMs (VLMs) have evolved to be a pivotal component in the intersection of computer vision and natural language processing, enabling a myriad of applications from zero-shot image classification to comprehensive image captioning. A fascinating question that arises in the context of VLMs is their alignment with human visual perception, particularly in how they navigate the balance between texture and shape bias. Historically, vision-only models displayed a pronounced preference for texture over shape, a pattern that diverges from human visual tendencies which favor shape. This paper explores the texture vs. shape bias within various VLMs and assesses whether the bias can be moderated or redirected through linguistic prompts, laying the groundwork for deeper inquiry into how these models perceive and interpret visual information.

Texture vs. Shape Bias in VLMs

An exhaustive analysis of popular VLMs reveals a nuanced landscape where, contrary to prior vision-only models, many VLMs exhibit a stronger inclination toward shape bias when processing visual information. This shift suggests that multimodal training involving both text and images does not merely transplant vision encoders' biases into VLMs but instead modulates these biases through linguistic integration. Crucially, while VLMs demonstrate a more shape-oriented approach than their vision-only counterparts, they still fall short of replicating the human propensity to prioritize shape significantly. Notably, certain models demonstrate an ability to adjust their bias based on the task, displaying varying levels of shape preference in tasks like visual question answering (VQA) and image captioning.

Investigation of Bias Modulation

The central inquiry into whether and how the visual biases in VLMs can be influenced through language reveals compelling outcomes. By employing task-specific prompting and altering the visual input (through pre-processing techniques like patch shuffling and noise addition), the paper explores the malleability of shape and texture biases. Intriguingly, text-based manipulations underscore the possibility of steering these biases to a considerable degree, albeit not as substantially as through visual alterations. This discovery opens intriguing avenues for research into the interplay between textual and visual information in guiding model perception.

Implications and Future Directions

The findings of this paper have broad implications, both theoretical and practical. On a theoretical level, the evidence that VLMs’ visual biases can be partially steered through linguistic inputs enriches our understanding of multimodal learning dynamics and the complex interplay between text and image processing. Practically, the ability to modulate visual biases in VLMs could enhance model performance across tasks that require nuanced visual understanding, from improved accessibility tools to more accurate visual search and annotation systems.

Looking ahead, this exploration sets the stage for further studies into the multimodal workings of VLMs, encouraging a deeper dive into the mechanisms that underpin bias modulation. Additionally, given the rapid evolution of VLM technologies, future work could extend beyond texture and shape bias to uncover other potential biases and the extent to which they can be shaped through multimodal interactions.

Conclusion

This paper provides a foundational exploration of the texture vs. shape bias in VLMs, revealing a marked departure from the tendencies observed in vision-only models. Through meticulous experimentation, it establishes that while VLMs naturally exhibit a more significant shape bias, this bias can be influenced, albeit modestly, through linguistic prompts. These insights not only enrich our understanding of VLMs’ operational dynamics but also offer practical pathways to enhance their alignment with human visual perception, marking a significant step forward in the quest to create more intuitive and effective multimodal models.

Tweets

https://twitter.com/ShahabBakht/status/1769372839809860077

https://twitter.com/BioPapers/status/1898025167667732504

https://twitter.com/PaulGavrikov/status/1768599394444820658

https://twitter.com/jmie_mirza/status/1882196722191130707

https://twitter.com/fenildoshi009/status/1940896931166966054

https://twitter.com/PaulGavrikov/status/1780144769974325292

YouTube

Show All Videos