Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs (2404.07449v1)

Published 11 Apr 2024 in cs.CV

Abstract: Integration of LLMs into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

References (77)

Citations (12)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/kahnchana/status/1778759616848859397

https://twitter.com/kahnchana/status/1800542785172566249

https://twitter.com/realmofresearch/status/1779330272959553682

https://twitter.com/CSVisionPapers/status/1778805401561010355

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs (2404.07449v1)

Summary

Related Papers

Tweets