Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Published 31 Dec 2020 in cs.CV | (2012.15840v3)

Abstract: Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.

Abstract PDF Upgrade to Chat

Authors (11)

Citations (2,589)

View on Semantic Scholar

Summary

The paper introduces SETR to replace conventional FCNs with a pure transformer encoder, overcoming receptive field limitations.
It reshapes images into sequences of patches processed with multi-head self-attention, capturing global context effectively.
Experiments on ADE20K, Pascal Context, and Cityscapes confirm that SETR achieves superior segmentation performance.

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

The paper "Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers" presents a novel approach to semantic segmentation by utilizing transformers in place of traditional fully convolutional networks (FCNs). The authors assert that the reliance on convolutional encoders inherently limits the receptive field, which is problematic for effective contextual modeling. Instead, they propose treating semantic segmentation as a sequence-to-sequence prediction task, hypothesizing that transformers can address receptive field limitations more effectively.

Model Architecture

In traditional FCN-based approaches to semantic segmentation, the encoder progressively reduces spatial resolution through convolutional layers, which can hinder long-range dependency learning. The paper challenges this architecture by introducing the SEgmentation TRansformer (SETR). The proposed SETR model reshapes the problem by treating the input image as a sequence of patches, which are processed by a transformer encoder to capture global context at each layer—thereby eliminating the need for progressive spatial resolution reduction.

Image Sequentialization

The initial step involves decomposing an image into a grid of fixed-size patches and flattening each patch into vectors, which are then embedded into a linear space. Position embeddings specific to each patch maintain spatial information during sequence transformation. This sequence of vectors forms the input to the transformer encoder, which models global dependencies through self-attention mechanisms.

Transformer Encoder

The core of SETR is a pure transformer encoder that operates on the sequence of patch embeddings. Each layer of the transformer consists of multi-head self-attention (MSA) and multilayer perceptron (MLP) blocks. The use of MSA allows the model to attend to different parts of the sequence with different heads, thereby capturing diverse contextual interactions. The output features from the transformer encoder can then be reshaped back into a spatial feature map suitable for segmentation.

Decoder Designs

The authors introduce three decoder architectures to evaluate the performance of their proposed transformer-based encoder:

Naive Upsampling (Naive): This approach directly projects the transformer's output to the number of target classes and performs bilinear upsampling to generate the final segmentation map.
Progressive Upsampling (PUP): This method gradually upscales the feature maps by alternating between convolutional layers and bilinear upsampling, thus preserving feature integrity through incremental resolution enhancements.
Multi-Level Feature Aggregation (MLA): This decoder aggregates features from multiple layers of the transformer encoder. Unlike traditional feature pyramid networks, the features aggregated are from layers of the same resolution, enriched by attention mechanisms at different stages.

Experimental Results

The authors conduct extensive experiments on several benchmark datasets including ADE20K, Pascal Context, and Cityscapes. The results demonstrate that SETR establishes new state-of-the-art performance on ADE20K and Pascal Context, and competitive results on Cityscapes. Notably, the SETR model with MLA decoder achieves 50.28% mIoU on ADE20K with multi-scale inference, a substantial improvement over previous state-of-the-art results. These findings suggest that SETR's ability to model global context at each layer provides a significant performance advantage over traditional FCN-based models.

Implications and Future Directions

The theoretical and practical implications of this research are profound. By successfully applying transformers to image-based tasks, this work bridges the gap between NLP and computer vision, suggesting that self-attention can effectively replace convolution in certain contexts. This paradigm shift opens up new avenues for rethinking other vision tasks currently dominated by convolutional architectures.

Furthermore, the ability to model long-range dependencies without progressively reducing spatial resolution can inspire novel architectures beyond segmentation tasks. Future research could explore optimizing transformer models for better computational efficiency, or explore hybrid models that combine the strengths of convolutional and transformer-based approaches.

Overall, the findings underscore the transformative potential of sequence-to-sequence models like transformers in the field of computer vision, paving the way for continued innovation in learning complex visual representations.

Markdown Report Issue