Papers
Topics
Authors
Recent
2000 character limit reached

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs (2404.07449v1)

Published 11 Apr 2024 in cs.CV

Abstract: Integration of LLMs into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Human activity analysis. ACM Computing Surveys (CSUR), 43:1 – 43, 2011.
  2. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  3. Openflamingo, 2023.
  4. Pratyay Banerjee et al. Weakly supervised relative spatial reasoning for visual question answering. ICCV, 2021.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
  7. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3557–3567, 2021.
  8. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  9. Diffusiondet: Diffusion model for object detection. arXiv preprint arXiv:2211.09788, 2022.
  10. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  12. Jaemin Cho et al. Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. ICCV, 2023.
  13. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  14. Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision. arXiv preprint arXiv:2203.05796, 2022.
  15. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  16. Decoupling zero-shot semantic segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11573–11582, 2021.
  17. Coarse-to-fine vision-language pre-training with fusion in the backbone. ArXiv, abs/2206.07643, 2022.
  18. Open-vocabulary image segmentation. In ECCV, 2022.
  19. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  20. Gokhale et al. Benchmarking spatial relationships in text-to-image generation, 2022.
  21. Zero-shot detection via vision and language knowledge distillation. arXiv e-prints, pages arXiv–2104, 2021.
  22. Visual programming: Compositional visual reasoning without training. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14953–14962, 2022.
  23. Activitynet: A large-scale video benchmark for human activity understanding. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015.
  24. Joy Hsu et al. What’s left? concept grounding with logic-enhanced foundation models. NeurIPS, 2023.
  25. Gqa: A new dataset for real-world visual reasoning and compositional question answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6693–6702, 2019.
  26. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 2021.
  27. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
  28. Amita Kamath et al. What’s ”up” with vision-language models? investigating their struggle with spatial reasoning. EMNLP, 2023.
  29. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  30. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  31. Language-driven semantic segmentation. ICLR, 2022a.
  32. Adapting clip for phrase localization without further training. ArXiv, abs/2204.03647, 2022b.
  33. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  34. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
  35. Grounded language-image pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10955–10965, 2021.
  36. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c.
  37. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  38. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  39. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. ArXiv, abs/2211.14813, 2022.
  40. Jitendra Malik. Visual grouping and object recognition. In Proceedings 11th International Conference on Image Analysis and Processing, pages 612–621. IEEE, 2001.
  41. David Marr. Vision: A computational investigation into the human representation and processing of visual information. MIT press, 1982.
  42. Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv 2306.05424, 2023.
  43. Open vocabulary semantic segmentation with patch aligned contrastive learning. CVPR, abs/2212.04994, 2023.
  44. OpenAI. GPT-4 technical report. https://arxiv.org/abs/2303.08774, 2023.
  45. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  46. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  47. Language-based action concept spaces improve video self-supervised learning. In NeurIPS, 2023.
  48. Perceptual grouping in contrastive vision-language models. In ICCV, 2023.
  49. Understanding long videos in one multimodal language model pass, 2024.
  50. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  51. Recognition of composite human activities through context-free grammar based representation. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2:1709–1718, 2006.
  52. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  53. Vipergpt: Visual inference via python execution for reasoning. ArXiv, abs/2303.08128, 2023.
  54. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9627–9636, 2019.
  55. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  56. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  57. Selective search for object recognition. IJCV, 104(2):154–171, 2013.
  58. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
  59. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021.
  60. Groupvit: Semantic segmentation emerges from text supervision. CVPR, 2022.
  61. Learning open-vocabulary semantic segmentation models from natural language supervision. ArXiv, abs/2301.09121, 2023.
  62. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv, 2022.
  63. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697, 2021.
  64. Zero-shot video question answering via frozen bidirectional language models. arXiv preprint arXiv:2206.08155, 2022.
  65. FILIP: Fine-grained interactive language-image pre-training. In ICLR, 2022.
  66. Ferret: Refer and ground anything anywhere at any granularity. ArXiv, abs/2310.07704, 2023.
  67. A joint speaker-listener-reinforcer model for referring expressions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3521–3529, 2016.
  68. Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023.
  69. Multi-grained vision language pre-training: Aligning texts with visual concepts. ArXiv, abs/2111.08276, 2021.
  70. Glipv2: Unifying localization and vision-language understanding. ArXiv, abs/2206.05836, 2022.
  71. Video-llama: An instruction-tuned audio-visual language model for video understanding. ArXiv, abs/2306.02858, 2023a.
  72. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. ArXiv, abs/2303.16199, 2023b.
  73. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023c.
  74. Associating spatially-consistent grouping with text-supervised semantic segmentation. ArXiv, abs/2304.01114, 2023d.
  75. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023.
  76. Extract free dense labels from clip. In European Conference on Computer Vision, 2021.
  77. Segment everything everywhere all at once. In NeurIPS, 2023.
Citations (12)

Summary

  • The paper introduces LocVLM, which integrates image-coordinate objectives to improve spatial reasoning in visual-LLMs.
  • It employs novel methodologies like LocPred, NegPred, and RevLoc to significantly boost performance in both image and video VQA tasks.
  • The framework achieves state-of-the-art results on benchmarks such as GQA and VQAv2 while reducing object hallucination.

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

This paper explores improvements in spatial reasoning for visual-LLMs (V-LLMs), which integrates LLMs into visual tasks like visual question answering (VQA). Although current V-LLMs like BLIP-2 and LLaVA perform well in vision-language tasks, they show a significant limitation in spatial reasoning, such as differentiating left from right. This work introduces methods to enhance spatial awareness using image-space coordinate-based objectives. The proposed framework demonstrates improved performance on spatial awareness tasks and VQA, minimizing hallucination and enhancing contextual object descriptions.

Introduction

The authors identify a crucial gap in existing V-LLMs regarding spatial reasoning. Despite their advancement in answering complex queries about image content, these models struggle with simple spatial distinctions. The researchers propose integrating spatial localization by embedding image-space coordinates within language prompts, aiming to enhance V-LLMs' spatial reasoning capabilities and performance in tasks requiring spatial awareness. Figure 1

Figure 1: The model's ability to use contextual region descriptions enhances spatial awareness in visual question answering.

Methodology

The methodology involves instruction fine-tuning objectives that explicitly incorporate spatial coordinates. The authors detail three main objectives: Location Prediction (LocPred), Negative Prediction (NegPred), and Reverse-Location Prediction (RevLoc). These are designed to embed spatial reasoning into V-LLMs, allowing them to process and generate coordinate-based information effectively.

The paper proposes a framework named LocVLM, which applies visual instruction tuning similar to LLaVA. It uses a visual encoder, an adapter layer, and an LLM simultaneously, enabling the V-LLM to extract meaningful coordinates and descriptions from images. The authors present a data-efficient pseudo-data generation strategy, leveraging pre-trained V-LLMs to enhance the training process without additional human annotations. Figure 2

Figure 2: Model architecture inspired by LLaVa with modifications to incorporate spatial reasoning tasks.

Experimental Results

The framework's effectiveness is tested across multiple benchmarks. Spatial reasoning evaluations reveal near random performance from existing V-LLMs, while LocVLM exhibits marked improvements. The framework achieves state-of-the-art results in image and video VQA tasks and successfully reduces object hallucination.

  1. Spatial Reasoning Results: LocVLM significantly outperforms BLIP-2 and LLaVA in distinguishing spatial relationships, showcasing enhanced spatial reasoning capabilities.
  2. Image VQA: Tested on GQA and VQAv2, LocVLM shows clear improvements, highlighting the impact of integrated spatial localization on complex reasoning tasks.
  3. Video Domain Operation: Adapted for video analysis, LocVLM further demonstrates superior performance in video VQA tasks like ActivityNet-QA, confirming its robust applicability across dynamic inputs.
  4. Object Hallucination: LocVLM showcases reduced hallucination incidents, a prevalent issue in current V-LLMs, as evidenced by improved accuracy across proposed and existing datasets. Figure 3

    Figure 3: Example images illustrating enhanced spatial reasoning capabilities in toy experiments.

Future Work and Implications

This research opens avenues for further exploration in spatial reasoning within V-LLMs, suggesting an integration framework that does not rely solely on increased training data but on strategic objective alignment. Future work could enhance temporal modeling for dynamic video understanding, embedding time-coordinate reasoning within the spatial framework.

The practical implications are profound, offering improved interaction in automated systems requiring spatial decision-making, such as robotics and autonomous navigation, where contextual awareness and accurate spatial interpretation are crucial.

Conclusion

The proposed LocVLM framework substantially enhances spatial reasoning in V-LLMs, highlighting its potential in bolstering VQA tasks. By integrating location-specific fine-tuning objectives, the framework overcomes inherent limitations in spatial reasoning, setting a new benchmark for V-LLMs in handling complex spatial tasks. These findings suggest a viable path for evolving AI towards more holistic visual and cognitive functionalities. Figure 4

Figure 4: Demonstrating the model's ability to provide representative descriptions for specific queried image regions.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 54 likes about this paper.