Referring Transformer: A One-step Approach to Multi-task Visual Grounding

Published 6 Jun 2021 in cs.CV | (2106.03089v2)

Abstract: As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-arts methods by a large margin on both REC and RES tasks. We also show that a simple pre-training schedule (on an external dataset) further improves the performance. Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (171)

View on Semantic Scholar

Summary

The paper introduces the Referring Transformer, a one-step approach using a transformer architecture for multi-task visual grounding, combining referring expression comprehension and segmentation.
Empirical results show substantial improvements over state-of-the-art methods across multiple datasets, with performance gains up to 19.4% on the RefCOCO dataset for RES.
This unified framework simplifies the design complexity of visual grounding tasks by removing the need for dense anchor definitions and enables streamline applications like image captioning and visual question-answering.

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

The paper introduces a novel framework for addressing multi-task visual grounding tasks, specifically referring expression comprehension (REC) and segmentation (RES), through a unified one-stage approach utilizing transformer architectures. This model is a significant move towards simplifying the design complexity traditionally associated with visual grounding tasks.

The proposed "Referring Transformer" effectively combines visual and linguistic modalities in a single-stage transformer-based architecture. This framework leverages a visual-lingual encoder and a contextualized decoder to simultaneously generate bounding boxes and segmentation masks from lingual queries. The major innovations lie in the highly contextualized fusion of modalities, which significantly improves upon previous two-stage methods and task-specific one-stage architectures. An additional strength of the model is its ability to synergistically improve upon REC and RES tasks when trained in a multi-task setting.

The empirical results reveal substantial improvements over state-of-the-art methods for both REC and RES tasks across several datasets, such as RefCOCO, RefCOCO+, and RefCOCOg, with performance gains ranging from 8.5% for REC to 19.4% for RES on the RefCOCO dataset. These results underscore the potential of end-to-end architectures to optimize feature representation and learning.

One of the model’s critical advantages is its simplicity, removing the need for dense anchor definitions or Hungarian matching, thereby enhancing robustness and convergence speed. The scalability of this approach is further demonstrated through effective pre-training on external datasets, which aids in improving the model’s performance, highlighting the importance of well-aligned cross-modal representations in pre-training scenarios.

The implications of this research are multifaceted. Practically, this unified framework can streamline visual comprehension systems in applications such as image captioning and visual question-answering, where grounding tasks are central. Theoretically, the integration of strong contextual reasoning in model design offers insights into improving vision-language co-processing.

Looking ahead, exploring adaptive pre-training strategies and handling complex queries that refer to multiple image regions are promising avenues for further enhancing the model's capabilities. Given the rapid evolution of multi-modal transformers, such advancements could lead to even broader applications in AI-driven visual understanding systems.

Overall, the "Referring Transformer" represents a significant contribution towards simplifying complex visual grounding tasks while achieving substantial performance gains. It sets a precedent for future research to build more efficient and scalable models that can handle multi-modal tasks within a unified framework.

Markdown Report Issue