Emergent Mind

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

(2404.13013)
Published Apr 19, 2024 in cs.CV , cs.AI , cs.CL , and cs.LG

Abstract

We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: https://groma-mllm.github.io/.

Groma encodes images into global tokens and local region tokens, enhancing MLLMs' referring and grounding abilities.

Overview

  • Ma et al. introduced the Groma model, an MLLM that enhances grounding by utilizing localized visual tokenization to link textual content specifically to image regions.

  • Groma employs DINOv2 for image encoding, uses a Deformable DETR for region proposals, and leverages Vicuna-7B for integrating visual and textual data.

  • The model outperformed others in localization tasks on complex datasets like RefCOCO and LVIS, demonstrating superior ability in producing accurate context-based textual descriptions.

  • Groma's training involves a mixed strategy using detection, alignment, and instruction finetuning to boost instruction-following capabilities and achieve nuanced multimodal interaction.

Exploring Localization in Multimodal LLMs through Groma: Localized Visual Tokenization for Enhanced Grounded Understanding

Introduction

Recent advancements in Multimodal LLMs (MLLMs) have significantly broadened their applicability across various AI domains, merging visual perception with linguistic components. Despite these progresses, the fulcrum of challenges in MLLMs nests within the precise localization and grounding of textual content to specific image regions. Addressing this, Ma et al. introduce Groma, an MLLM that utilizes localized visual tokenization to ameliorate grounding capabilities in multimodal contexts.

Model Architecture

Groma is constituted of several key components meticulously designed to streamline the integration of visual and textual modalities:

  • Image Encoder: Utilizing DINOv2, the model processes high-resolution images, subsequently downsampling the visual tokens for efficient processing without compromising localization accuracy.
  • Region Proposer: A Deformable DETR transformer is employed to generate region proposals efficiently, ensuring robust detection across a range of objects and scenes.
  • Region Encoder: This module encodes the region proposals into discrete tokens, providing direct anchors for textual descriptions within the image, thereby facilitating a precise grounding mechanism.
  • Language Model: Leveraging Vicuna-7B, Groma integrates visual tokens into textual responses, enabling comprehensive and context-aware multimodal understanding.

Empirical Evaluation

The Groma model showcased significant superiority in localization tasks especially, when evaluated against benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg. The model not only excelled in generating contextually and spatially accurate textual descriptions but also demonstrated robust capabilities on complex datasets like LVIS, designed to test object recognition and localization across diverse and cluttered scenes.

Training Strategy

The authors propose a three-stage training strategy:

  1. Detection Pretraining: Focuses on leveraging extensive detection datasets to refine the capabilities of the region proposer.
  2. Alignment Pretraining: Aims at aligning the vision-language model by utilizing a mixture of image-caption, grounded caption, and region caption datasets.
  3. Instruction Finetuning: Utilizes high-quality datasets and instruction-based tasks to enhance the model's performance in instruction-following contexts.

Implications and Future Work

The introduction of Groma pushes the boundaries of what MLLMs can achieve in terms of region-specific understanding and grounding. This model sets a precedent for how localized visual tokenization can facilitate more nuanced interactions between visual and textual data, leading to better performances in applications requiring detailed visual understanding.

Furthermore, the methodological innovations in Groma provide a pathway for future research, potentially leading to more sophisticated architectures that could seamlessly integrate even more diverse modalities beyond vision and text, such as audio or sensory data. The training strategies outlined could also serve as a blueprint for developing more robust and context-aware AI systems, particularly useful in domains like autonomous driving, robotic navigation, and interactive educational technologies.

Conclusion

Groma represents a significant step forward in the development of MLLMs with fine-grained visual perception abilities. By embedding localization directly into the tokenization process, Groma not only enhances the model's efficiency but also its efficacy in understanding and interacting with the visual world in a contextually relevant manner. The strides made in this research illuminate promising directions for future advancements in AI, fueling further exploration into the capabilities of multimodal systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.