LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Published 4 Dec 2021 in cs.CV and cs.CL | (2112.02244v2)

Abstract: Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image. A paradigm for tackling this problem is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advancements in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. By conducting cross-modal feature fusion in the visual feature encoding stage, we can leverage the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results are readily harvested with a light-weight mask predictor. Without bells and whistles, our method surpasses the previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (252)

View on Semantic Scholar

Summary

The paper introduces an early fusion method that integrates language cues during visual encoding using a hierarchical Transformer and pixel-word attention module.
It achieves significant IoU improvements on RefCOCO, RefCOCO+, and G-Ref, including a 7.08% boost on RefCOCO over previous methods.
The study demonstrates the benefits of an integrated encoder for cross-modal tasks, paving the way for enhanced dialogue systems and human-robot interaction.

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

This paper introduces a novel framework called Language-Aware Vision Transformer (LAVT) targeting the task of referring image segmentation, where the goal is to isolate and mask an image region corresponding to a given textual description. LAVT signifies a departure from traditional paradigms that employ separate vision and language encoder networks followed by a cross-modal decoder. Instead, LAVT proposes the early integration of linguistic data during the visual feature encoding stage within a Transformer structure, ensuring improved cross-modal alignment.

Method Overview

LAVT adopts a hierarchical vision Transformer backbone, leveraging its capability to fuse linguistic and visual features throughout the encoding process. The process works through a sequence of Transformer encoding layers divided into four stages. During each stage, the task involves enriching visual features with pertinent linguistic context. The proposed framework includes a pixel-word attention module (PWAM) that dynamically aligns language features with visual inputs at each spatial position in the image. The language information is integrated using a language pathway that employs a learnable gating mechanism to moderate the flow of language cues.

The framework culminates in a lightweight mask predictor, leveraging these enriched language-aware visual features for precise segmentation.

Performance and Results

The efficacy of LAVT is illustrated with experiments across multiple benchmark datasets: RefCOCO, RefCOCO+, and G-Ref. Results demonstrate a substantial improvement over previous state-of-the-art methods. For instance, LAVT achieves an overall IoU of 72.73%, 62.14%, and 61.24% on the validation sets of RefCOCO, RefCOCO+, and G-Ref, respectively. These results underscore significant margin improvements, such as a 7.08% increase on RefCOCO, when compared to competing methods. The analysis extends to ablation studies evaluating the importance of individual components like the language pathway and pixel-word attention mechanism, confirming their critical role in the framework's success.

Theoretical and Practical Implications

The proposed early feature fusion within the Transformer encoder aligns visual and linguistic cues more effectively, demonstrating Transformers’ potential for cross-modal tasks beyond the scope of classification and detection in vision. This approach could influence a shift towards integrated encoder-decoder architectures in other tasks involving multimodal data, suggesting applicability in rich cross-modal dialogue systems and enhanced human-robot interaction scenarios.

The implementation insights and ablation studies provide a comprehensive understanding of LAVT's contributions, underlying methodology, and potential areas for future exploration. Further research might explore optimizing such fusion strategies in diverse application contexts or investigate architectures that exploit different modalities, thereby enhancing the generality and robustness of AI systems interfacing with heterogeneous data forms.

Markdown Report Issue