Papers
Topics
Authors
Recent
2000 character limit reached

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs (2401.06209v2)

Published 11 Jan 2024 in cs.CV

Abstract: Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of LLMs. However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-LLMs and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. ShareGPT, 2023.
  2. Flamingo: a visual language model for few-shot learning. In NeruIPS, 2022.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, 2023.
  5. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. 2022.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In ICML, 2021.
  10. Data filtering networks. arXiv preprint arXiv:2309.17425, 2023.
  11. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  12. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. In NAACL, 2019.
  13. Google. Bard, 2023a.
  14. Google. Gemini, 2023b.
  15. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  16. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, 2020.
  17. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  18. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  19. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. In NeurIPS, 2023.
  20. Prompt-based methods may underestimate large language models’ linguistic generalizations. In EMNLP, 2023.
  21. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  22. Shap-E: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  23. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  24. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
  25. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
  26. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
  27. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023b.
  28. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for GPT-4V (ision), LLaVA-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023a.
  29. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023b.
  30. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023c.
  31. Visual instruction tuning. 2023d.
  32. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023e.
  33. Decoupled weight decay regularization. In ICLR, 2017.
  34. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
  35. OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  36. On measuring social biases in sentence encoders. In NAACL, 2019.
  37. Microsoft. newbing, 2023.
  38. OCR-VQA: Visual question answering by reading text in images. In ICDAR, 2019.
  39. Slip: Self-supervision meets language-image pre-training. In ECCV, 2022.
  40. OpenAI. GPT-4V(ision) System Card, 2023a.
  41. OpenAI. Gpt-4 technical report, 2023b.
  42. DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  43. Learning transferable visual models from natural language supervision. In ICML, 2021.
  44. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
  45. Imagenet-21k pretraining for the masses. In NeurIPS, 2021.
  46. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  47. Imagenet large scale visual recognition challenge. IJCV, 2015.
  48. LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  49. A-OKVQA: A benchmark for visual question answering using world knowledge. In ECCV, 2022.
  50. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  51. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, 2020.
  52. Towards VQA models that can read. In CVPR, 2019.
  53. The effectiveness of MAE pre-pretraining for billion-scale pretraining. In ICCV, 2023.
  54. EVA-CLIP: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  55. Mitigating gender bias in natural language processing: Literature review. In ACL, 2019.
  56. Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022.
  57. Mass-producing failures of multimodal systems with language models. In NeurIPS, 2023.
  58. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  59. LLaMA 2: Open foundation and fine-tuned chat models. 2023b.
  60. Image captioners are scalable vision learners too. NeurIPS, 2023.
  61. Convnet vs transformer, supervised vs clip: Beyond imagenet accuracy, 2024.
  62. Demystifying CLIP data. arXiv preprint arXiv:2309.16671, 2023.
  63. The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision). arXiv preprint arXiv:2309.17421, 2023.
  64. MM-Vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  65. When and why vision-language models behave like bags-of-words, and what to do about it? In ICLR, 2022.
  66. Sigmoid loss for language image pre-training. In ICCV, 2023a.
  67. Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313, 2023b.
  68. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  69. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, 2023.
  70. iBOT: Image BERT pre-training with online tokenizer. In ICLR, 2021.
  71. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Citations (179)

Summary

  • The paper identifies significant visual grounding defects by leveraging CLIP-blind pairs and the MMVP benchmark.
  • The paper introduces the Mixture-of-Features strategy, especially the Interleaved-MoF, to integrate vision-specific models with traditional language-image training.
  • Benchmark results reveal that state-of-the-art MLLMs struggle with basic visual tasks, underscoring a critical gap in visual discrimination.

An Examination of Visual Shortcomings in Multimodal LLMs

Introduction

The paper "Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs" explores the limitations of current Multimodal LLMs (MLLMs) despite their impressive advancements in integrating visual information with textual reasoning. While models such as GPT-4V embody the cutting-edge in tasks like Visual Question Answering (VQA) and multimodal interactions, they still reveal notable defects, particularly in visual grounding. This paper systematically investigates these shortcomings and suggests potential directions for improvement.

Identifying Visual Limitations

A significant discovery in this research is the concept of "CLIP-blind pairs"—image pairs that are visually distinct but encoded similarly by the CLIP model. These pairs form the basis of the Multimodal Visual Patterns (MMVP) benchmark, designed to test the visual processing capabilities of MLLMs using basic visual questions. Figure 1

Figure 1: Constructing MMVP benchmark via CLIP-blind pairs. Left: Finding CLIP-blind pairs with similar CLIP embedding but different DINOv2 embedding; Center: Inspecting image differences; Right: Querying MLLMs with these images.

Results from the MMVP benchmark across state-of-the-art models, including GPT-4V, reveal substantial deficiencies. Remarkably, MLLMs struggle with questions that humans effortlessly resolve, indicating that advancements in language reasoning have not been matched by comparable improvements in visual discrimination.

Systematic Failures and Visual Patterns

Beyond individual failures, the paper categorizes systematic visual patterns that MLLMs struggle with, identified through CLIP-blind pairs. These patterns include basic visual concepts such as object orientation, counting, and specific feature presence, which are crucial for detailed visual understanding.

Figure 2 demonstrates examples of questions in the MMVP benchmark that expose these systematic failures in various current models: Figure 2

Figure 2: Examples of Questions in the MMVP benchmark. Incorrect answers are shaded in red.

Given these insights, the research suggests that MLLMs' reliance on CLIP-like vision encoders could bottleneck their performance in tasks requiring precise visual grounding, further affirmed by a detailed benchmark analysis (Figure 3). Figure 3

Figure 3: Benchmark results of current SOTA MLLM models and humans.

Mixture-of-Features Approach

The paper's key contribution towards overcoming these visual limitations is the Mixture-of-Features (MoF) strategy. By integrating features from both vision-specific models like DINOv2 and language-image models like CLIP, MLLMs can enhance visual grounding without sacrificing instruction-following capabilities. Several MoF strategies, such as Additive-MoF and Interleaved-MoF, have been tested, providing significant improvements in visual tasks. Figure 4

Figure 4: Different Mixture-of-Feature (MoF) Strategies in MLLM.

The Interleaved-MoF strategy, in particular, manages to retain the strengths of both vision models and enhances MLLM performance on the evaluation benchmarks significantly. This hybrid approach underlines the importance of blending vision-centric learning with traditional language-image pretraining.

Implications and Future Directions

The findings of this paper highlight that while LLMs have evolved significantly, their visual counterparts need targeted improvements. Returning to foundational visual understanding—by concentrating on rich, nuanced features beyond what current CLIP models capture—is vital for future progress. As CLIP-based models scale, integrating methods like MoF represents a promising direction to bridge existing gaps in multimodal understanding.

Conclusion

The paper elucidated the visual deficiencies in current state-of-the-art MLLMs and introduced robust benchmarks and methods to address these challenges. By reinforcing visual components in these models, the research sets a course for a more balanced and holistic integration of vision and language, essential for practical, real-world application across diverse AI-driven tasks.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 32 tweets with 838 likes about this paper.