Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Published 14 Mar 2024 in cs.CV and cs.AI | (2403.09333v1)

Abstract: Large Vision LLMs have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, Counting and \etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scaling up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in LLMs. This design inherently preserves the complete contexts and fine details, and significantly improves multimodal perception ability especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts and even coordinates. Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting. Data, codes and models will be released at https://github.com/jefferyZhan/Griffon.

Abstract PDF HTML Upgrade to Chat

Authors (6)

Citations (8)

View on Semantic Scholar

Summary

The paper demonstrates a novel high-resolution LVLM that overcomes token constraints to retain fine details for precise object perception.
It introduces a visual-language co-referring mechanism that integrates visual tokens with textual cues for flexible target identification.
The model achieves state-of-the-art results in REC, REG, object detection, and counting, unifying multiple domains under one robust framework.

Advancing Multimodal Perception with Griffon v2

The paper "Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring" elaborates on a significant stride in the domain of large vision-LLMs (LVLMs). The research presented introduces Griffon v2, a model that seeks to overcome notable limitations in current LVLMs, particularly concerning image resolution and the need for nuanced object perception in dense and complex scenarios. The core innovation of this work is a high-resolution generalist model equipped with flexible object referring capabilities via both visual and textual prompts.

Key Contributions

High-Resolution Perception: Griffon v2 addresses the constraints of standard image resolutions in LVLMs. By introducing a novel high-resolution structure and a lightweight down-sampling projector, the model bypasses the typical limitations posed by input token constraints in LLMs. This design inherently retains complete contexts and fine details, enhancing performance significantly in tasks requiring precise perception of small objects.
Visual-Language Co-Referring: The authors propose a co-referring mechanism that augments Griffon v2’s ability to interact with flexible target inputs, incorporating visual tokens through a plug-and-play visual tokenizer. This allows the model to navigate and process interactions featuring local cropped images, free-form texts, and coordinate inputs. Such versatility is poised to enhance user experience in applications like graphical user interfaces (GUI), object counting, and beyond.
Comprehensive Evaluation: Griffon v2 demonstrates state-of-the-art performance across a range of evaluation tasks, including Referring Expression Comprehension (REC), phrase grounding, Referring Expression Generation (REG), object detection, and object counting. Notably, in object detection and counting, Griffon v2 surpasses specialized expert models, highlighting its capability to unify multiple task domains under one framework.

Experimental Results

The experimental setup involved extensive evaluations on established datasets for REC, REG, and phrase grounding. The findings reveal Griffon v2's superior ability to comprehend and localize objects with precision that outmatches current leading methodologies.

REC and REG Tasks: Griffon v2 achieved competitive accuracy with particularly notable improvements in scenarios requiring high discrimination between similar adjacent objects.
Object Detection and Counting: The paper reports an unprecedented performance by Griffon v2 in object detection tasks, facilitating detailed perception without the fragmentation of input data into smaller patches. This efficiency is complemented by its high-resolution token processing capability, enhancing accuracy in object counting across various domains.

Implications and Future Directions

The advancements presented through Griffon v2 have profound implications for the development and application of LVLMs in real-world scenarios. By bridging the gap between low-resolution perception and the need for meticulous object and language understanding, Griffon v2 lays a foundational stone for future exploration in multimodal AI systems.

Practically, Griffon v2 promises to enhance AI-driven solutions where detailed image understanding is crucial. Theoretically, its hybrid architecture and co-referring capabilities offer insightful directions for ongoing and future research in optimizing multimodal interactions.

Future developments may focus on further refining the model's scalability concerning even higher resolutions and expanded datasets, as well as exploring its adaptability to a broader range of interactive applications in diverse industries. The release of data and model resources as stated ensures that the community can build upon this foundational work, pushing the boundaries of what LVLMs can achieve.

In conclusion, Griffon v2 stands as a pivotal advancement in large vision-LLMs, achieving a balance of resolution efficiency and interactive smoothness that paves the way for next-generation multimodal AI advancements.

Markdown Report Issue