Fully Transformer Networks for Semantic Image Segmentation (2106.04108v3)

Published 8 Jun 2021 in cs.CV

Abstract: Transformers have shown impressive performance in various natural language processing and computer vision tasks, due to the capability of modeling long-range dependencies. Recent progress has demonstrated that combining such Transformers with CNN-based semantic image segmentation models is very promising. However, it is not well studied yet on how well a pure Transformer based approach can achieve for image segmentation. In this work, we explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN). Specifically, we first propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, meanwhile reducing the computation complexity of the standard Visual Transformer (ViT). Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation. Surprisingly, this simple baseline can achieve better results on multiple challenging semantic segmentation and face parsing benchmarks, including PASCAL Context, ADE20K, COCOStuff, and CelebAMask-HQ. The source code will be released on https://github.com/BR-IDL/PaddleViT.

Citations (35)

View on Semantic Scholar

Summary

The paper introduces a novel fully Transformer architecture that replaces traditional CNN components with a Pyramid Group Transformer encoder and Feature Pyramid Transformer decoder.
It achieves improved segmentation accuracy with notable mIoU gains on benchmarks such as PASCAL Context, ADE20K, and COCO-Stuff.
The study challenges conventional CNN-Transformer hybrids and paves the way for future research in efficient, pure Transformer approaches for dense prediction tasks.

Fully Transformer Networks for Semantic Image Segmentation

This paper presents a paper on the effectiveness of Fully Transformer Networks (FTN) for semantic image segmentation, departing from traditional CNN-Transformer hybrid models. The authors propose an encoder-decoder architecture encompassing two novel components: the Pyramid Group Transformer (PGT) as the encoder and the Feature Pyramid Transformer (FPT) as the decoder.

Contributions

Pyramid Group Transformer (PGT): PGT is designed to efficiently learn hierarchical features by progressively increasing the receptive fields in a pyramid pattern. This approach contrasts with the global receptive field maintained in standard Visual Transformers (ViT). By grouping feature maps and applying self-attention within each group, it controls complexity while retaining the ability to model intricate patterns at various stages.
Feature Pyramid Transformer (FPT): Acting as the decoder, FPT effectively fuses multi-level semantic and spatial information from the PGT encoder. The architecture leverages Transformers' long-range dependency modeling to enhance contextual information capture, crucial for pixel-level accuracy in segmentation tasks.
Benchmark Evaluation: The FTN demonstrates superior performance across several benchmarks, including PASCAL Context, ADE20K, COCO-Stuff, and CelebAMask-HQ. Noteworthy improvements in mean Intersection over Union (mIoU) metrics are reported, with values of 56.05% on PASCAL Context, 51.36% on ADE20K, and 45.89% on COCO-Stuff.

Theoretical Implications

The proposition of a pure Transformer-based approach for image segmentation challenges the prevailing notion of integrating CNN layers for spatial information recovery. The PGT's ability to manage computational costs while enhancing feature representation indicates potential shifts in model design for dense prediction tasks.

Practical Implications and Future Directions

The framework sets a precedent for deploying Transformers in tasks traditionally dominated by CNN architectures. Future developments may consider expanding FTN for real-time applications, optimizing training for various hardware architectures, or exploring hybrid strategies that leverage the strengths of both Transformer and CNN models in specific task components.

Conclusion

The FTN approach underscores the versatility and potency of Transformer architectures in semantic image segmentation. By innovatively structuring the encoder and decoder, the paper contributes valuable insights to the evolving discussion on Transformer viability in computer vision tasks. The paper's findings suggest substantial opportunities for further research in refined Transformer designs and cross-disciplinary applications.

PDF Markdown

Related Papers

GitHub

GitHub - BR-IDL/PaddleViT: :robot: PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+ (1,238 stars)