Vision Transformers for Dense Prediction (2103.13413v1)

Published 24 Mar 2021 in cs.CV

Abstract: We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.

Citations (1,430)

View on Semantic Scholar

Summary

The paper introduces DPT, a novel architecture that replaces traditional CNN backbones with transformer encoders to preserve high-resolution global context throughout processing.
The paper details an encoder-decoder design where high-resolution tokens from the transformer encoder are progressively refined by convolutional decoders, avoiding resolution loss.
The paper demonstrates significant empirical gains, including a 49.02% mIoU on ADE20K, and highlights improved performance in depth estimation and semantic segmentation tasks.

Vision Transformers for Dense Prediction Tasks

Introduction to Dense Vision Transformers

The dense vision transformer (DPT) architecture marks a significant departure from the traditional convolutional neural network (CNN) frameworks that have long dominated dense prediction tasks. Leveraging the vision transformer (ViT) as a backbone, DPT supplants the routine encConvolutional architectures with a transformer encoder and a subsequent convolutional decoder. This shift diverges from the canonical approach where an encoder progressively downsamples the image for feature extraction across multiple scales, a method inherent to convolutional backbones. Instead, DPT works with a global receptive field from the initial stage and maintains constant and relatively high-resolution representations throughout the processing stages.

Redefining Dense Predictions

DPT promises to render finer-grained and more globally coherent predictions as compared to the traditional fully-convolutional networks (FCN). This quality is attributed to the distinctive nature of the transformer's global receptive field that is consistent across every processing stage. Contrary to the FCN's limited receptive field that expands through subsequent layers, DPT starts with a global perspective, providing a compelling advantage for dense prediction tasks.

Structural Overview

The DPT architecture adheres to an encoder-decoder blueprint. The encoder, equipped with the transformer mechanism, forgoes explicit downsampling post-initial image embedding, retaining a uniform high-resolution representation. The tokens yielded from the transformer encoder are then progressively reconstructed into increasingly higher-resolution predictions via the decoder, utilizing fusion blocks. This method contrasts with common practices that involve loss of resolution due to downsampling, thus mitigating the drawbacks associated with traditional convolutional approaches. The DPT shows significant improvements across various benchmarks, lending it favorably to tasks that benefit from an extensive and detailed contextual understanding of the visual scene.

Empirical Advancements and Implications

In empirical testing, DPT demonstrates substantial gains, particularly evident when vast quantities of training data are available. Such performance elevations are measured through tasks like monocular depth estimation and semantic segmentation. For instance, DPT has set new benchmarks in the ADE20K dataset with an impressive 49.02% mean Intersection over Union (mIoU), underscoring its exceptional capabilities over state-of-the-art FCN. Moreover, when fine-tuned on smaller datasets, DPT continues to surpass existing standards, which suggests that it holds an architecture with the innate potential to utilize transformer benefits, transcending the scale of available data.

In closing, this work brings to the fore a transformative architecture—DPT—that capitalizes on the strengths of vision transformers, providing an enticing alternative to the conventional convolution-dominated paradigm in dense prediction applications. With this evolution, the approach sets a new standard in tasks necessitating perceptive depth and contextual background, portending substantial advancements in the field of visual comprehension by deep learning models. The release of the DPT models further encourages exploration and adaptation across diversified tasks, signaling a new epoch in dense prediction methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - isl-org/DPT: Dense Prediction Transformers (2,166 stars)

Tweets

https://twitter.com/darkyboy_/status/1831654472315236419

YouTube

Show All Videos