Semantic Segmentation using Vision Transformers: A survey

Published 5 May 2023 in cs.CV, cs.AI, and cs.LG | (2305.03273v1)

Abstract: Semantic segmentation has a broad range of applications in a variety of domains including land coverage analysis, autonomous driving, and medical image analysis. Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation. Even though ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection since ViT is not a general purpose backbone due to its patch partitioning scheme. In this survey, we discuss some of the different ViT architectures that can be used for semantic segmentation and how their evolution managed the above-stated challenge. The rise of ViT and its performance with a high success rate motivated the community to slowly replace the traditional convolutional neural networks in various computer vision tasks. This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets. This will be worthwhile for the community to yield knowledge regarding the implementations carried out in semantic segmentation and to discover more efficient methodologies using ViTs.

Abstract PDF Upgrade to Chat

Citations (90)

View on Semantic Scholar

Summary

The paper demonstrates that ViT-based models, including SETR and SegFormer, achieve competitive segmentation performance on benchmarks like ADE20K and Cityscapes.
It introduces architectural innovations that address challenges such as patch partitioning and high computational complexity in Transformers.
The survey outlines promising future directions for integrating Transformer strategies into diverse, high-resolution semantic segmentation applications.

Semantic Segmentation using Vision Transformers: A Survey

Semantic segmentation, a crucial component of computer vision, entails assigning a label to each pixel in an image, thereby identifying different objects or regions. Its applications are diverse, spanning areas such as land cover analysis, autonomous driving, and medical imaging. This survey focuses specifically on employing Vision Transformers (ViTs) rather than traditional convolutional neural networks (CNNs) for semantic segmentation tasks.

The paper provides an extensive examination of various ViT architectures tailored for semantic segmentation, addressing the unique challenges they face in dense prediction tasks. Primarily, it underscores the difficulty of applying ViTs directly to such tasks due to their inherent design choices, like patch partitioning, which are optimized for image classification. Given the established efficacy of ViTs in classification, this survey investigates how alternative architectural changes and hybrid designs have been leveraged to adapt ViTs for segmentation.

Key Contributions and Architectural Innovations

SETR: The SEgmentation TRansformer (SETR) replaces convolutions with a pure Transformer framework, introducing a sequence-to-sequence prediction paradigm for segmentation. SETR variants differ based on their decoding strategies, which include progressive up-sampling and multi-level feature aggregation, achieving compelling results on datasets like ADE20K and Cityscapes.
Swin Transformer: To address the computational complexity, the Swin Transformer introduces a hierarchical structure with shifted windows for self-attention, effectively reducing computational costs and achieving significant accuracy in both segmentation and detection tasks.
Segmenter: This model replaces CNN backbones with a ViT and includes a mask transformer for decoding, which enhances its ability to incorporate global context, a known limitation of CNN-based models.
SegFormer: Known for its simplicity and efficiency, SegFormer utilizes a hierarchical Transformer encoder and a lightweight MLP decoder, achieving excellent results through its positional encoding-free design, critical for handling images of varying resolutions.
Pyramid Vision Transformer (PVT): The PVT architecture tackles the computational inefficiencies of ViTs by employing a pyramid structure and spatial reduction attention, maintaining precision while easing computational demands.
Twins: Twins incorporates spatially separable self-attention to address both global and local feature interactions without the computational heft associated with high-resolution inputs.
Dense Prediction Transformer (DPT): The DPT demonstrates superior dense predictions through ViT encoders, offering fine-grained outputs crucial for applications like depth estimation and medical imaging.
HRFormer: Operating at high resolution throughout its layers, HRFormer applies depth-wise convolutions for efficient feature extraction, ensuring that fine details are retained in the segmentation outputs.
Mask2Former: A universal segmentation framework that leverages masked attention, improving cross-attention efficiency by focusing on the regions of interest rather than the entire image.

Comparative Performance

The survey analyzes benchmark results across a variety of standard datasets like ADE20K, Cityscapes, and PASCAL-Context. The performance indicators, primarily mean Intersection over Union (mIoU), reflect the competitive nature of these ViT-based architectures against traditional methods, often surpassing them under comparable computational conditions.

Implications and Future Directions

The advancements outlined in the paper suggest a promising pathway for integrating ViTs into more areas of semantic segmentation, promoting the transition from CNNs to a Transformer-based approach across diverse applications. The highlighted architecture adaptations not only alleviate the computational burdens typically associated with Transformers but also promise enhanced segmentation accuracy, essential for real-world applications.

Looking forward, further innovations in ViT designs are anticipated to enable more efficient training regimes, greater scalability, and a broader range of applications, particularly in domains requiring precise, high-resolution outputs. Exploring Transformers in less conventional areas of semantic analysis could provoke new methodologies and efficiencies within AI and computer vision.

Markdown Report Issue