GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Published 26 Feb 2024 in cs.CV, cs.AI, and cs.CL | (2402.16846v2)

Abstract: Most multimodal LLMs (MLLMs) learn language-to-object grounding through causal language modeling where grounded objects are captured by bounding boxes as sequences of location tokens. This paradigm lacks pixel-level representations that are important for fine-grained visual understanding and diagnosis. In this work, we introduce GROUNDHOG, an MLLM developed by grounding LLMs to holistic segmentation. GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone, which then connects groundable phrases to unified grounding masks by retrieving and merging the entity masks. To train GROUNDHOG, we carefully curated M3G2, a grounded visual instruction tuning dataset with Multi-Modal Multi-Grained Grounding, by harvesting a collection of segmentation-grounded datasets with rich annotations. Our experimental results show that GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning, and significantly reduces object hallucination. GROUNDHOG also demonstrates better grounding towards complex forms of visual input and provides easy-to-understand diagnosis in failure cases.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (23)

View on Semantic Scholar

Summary

The paper presents a multimodal LLM that uses a masked feature extractor to achieve pixel-level phrase grounding, moving beyond traditional bounding box methods.
It integrates holistic segmentation and entity tokens to provide interpretable grounding masks with associated confidence scores, effectively reducing object hallucination.
Experiments on benchmarks like Flickr30K-Entity and TextVQA-X show significant improvements in language quality and grounding accuracy without task-specific fine-tuning.

Grounding LLMs to Holistic Segmentation

The paper introduces "Groundhog," a multimodal LLM (MLLM) designed to enhance pixel-level phrase grounding in LLMs through holistic segmentation. This approach addresses the limitations of conventional bounding box-based language-to-object grounding by offering more precise and interpretable segmentation representations. Groundhog uses a masked feature extractor that transforms image features into visual entity tokens, which the MLLM subsequently maps to grounding masks. The model bypasses traditional bounding box constraints and supports integration with various mask proposal networks, including the Segment Anything Model (SAM). This allows Groundhog to achieve a comprehensive semantic understanding across different visual granularities.

The paper provides a methodological framework that involves the construction of entity features from binary masks and the retrieval and merging of entity masks based on grounding queries. These grounding tokens enhance interpretability and transparency in the grounding process, allowing users to visualize confidence scores associated with each proposed mask. This approach not only improves grounding accuracy but also reduces object hallucination, which is a notable challenge in multimodal models.

On the dataset front, the paper introduces M3G2, a dataset crafted from existing segmentation-grounded datasets. M3G2 comprises 2.5 million text-image pairs, organized into four main task types: Grounded Image Captioning (GIC), Referential Expression Segmentation (RES), Grounded Visual Question Answering (GVQA), and Referential Dialogue (RD). This dataset serves as a training ground for Groundhog, supporting its instruction tuning across diverse grounding scenarios.

In the experiments, Groundhog demonstrates superior performance in various benchmarks involving RES, GIC, GVQA, and RD tasks, without needing task-specific fine-tuning. For instance, on the Flickr30K-Entity dataset, Groundhog improves both language quality and grounding accuracy considerably. Another significant improvement is observed in the TextVQA-X benchmark for visual text QA, where Groundhog surpasses specialist models by a substantial margin. Additionally, Groundhog achieves competitive performance on RIO and ReasonSeg datasets that require deep reasoning and contextual understanding.

This research implies that the integration of holistic segmentation with LLMs can drastically improve the grounding capability of MLLMs, paving the way for more precise vision-language interactions. By achieving pixel-level alignment, the model enhances its diagnostic capability, enabling users to identify and understand failure cases easily. However, the paper acknowledges limitations and suggests exploring extensions to video and 3D modalities for broader applicability.

Future developments could involve scaling the dataset to web data to capture a wider range of visual semantics and entities. Additionally, exploring language-guided segmentation models that leverage holistic segmentation could lead to significant advancements in the field of AI-driven image understanding. Overall, Groundhog establishes an important step toward developing efficient, transparent, and versatile multimodal models.

Markdown Report Issue