Emergent Mind

Vision Transformers for Dense Prediction

(2103.13413)
Published Mar 24, 2021 in cs.CV

Abstract

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.

Overview

  • The Vision Transformers for Dense Prediction paper introduces a novel Dense Vision Transformer (DPT) architecture that departs from traditional CNN frameworks for dense prediction tasks.

  • DPT incorporates a transformer-based encoder and a convolutional decoder, maintaining high-resolution representations throughout the process, unlike conventional methods that downsample images.

  • The architecture results in finer and more globally coherent predictions due to the transformer's global receptive field, making it advantageous for dense prediction tasks.

  • In empirical tests, DPT achieves significant improvements over existing methods, setting new benchmarks on datasets such as ADE20K and performing well on smaller datasets.

  • This work suggests that DPT's transformer-based approach could set a new standard for visual comprehension tasks in deep learning, inviting further research and application.

Vision Transformers for Dense Prediction Tasks

Introduction to Dense Vision Transformers

The dense vision transformer (DPT) architecture marks a significant departure from the traditional convolutional neural network (CNN) frameworks that have long dominated dense prediction tasks. Leveraging the vision transformer (ViT) as a backbone, DPT supplants the routine encConvolutional architectures with a transformer encoder and a subsequent convolutional decoder. This shift diverges from the canonical approach where an encoder progressively downsamples the image for feature extraction across multiple scales, a method inherent to convolutional backbones. Instead, DPT works with a global receptive field from the initial stage and maintains constant and relatively high-resolution representations throughout the processing stages.

Redefining Dense Predictions

DPT promises to render finer-grained and more globally coherent predictions as compared to the traditional fully-convolutional networks (FCN). This quality is attributed to the distinctive nature of the transformer's global receptive field that is consistent across every processing stage. Contrary to the FCN's limited receptive field that expands through subsequent layers, DPT starts with a global perspective, providing a compelling advantage for dense prediction tasks.

Structural Overview

The DPT architecture adheres to an encoder-decoder blueprint. The encoder, equipped with the transformer mechanism, forgoes explicit downsampling post-initial image embedding, retaining a uniform high-resolution representation. The tokens yielded from the transformer encoder are then progressively reconstructed into increasingly higher-resolution predictions via the decoder, utilizing fusion blocks. This method contrasts with common practices that involve loss of resolution due to downsampling, thus mitigating the drawbacks associated with traditional convolutional approaches. The DPT shows significant improvements across various benchmarks, lending it favorably to tasks that benefit from an extensive and detailed contextual understanding of the visual scene.

Empirical Advancements and Implications

In empirical testing, DPT demonstrates substantial gains, particularly evident when vast quantities of training data are available. Such performance elevations are measured through tasks like monocular depth estimation and semantic segmentation. For instance, DPT has set new benchmarks in the ADE20K dataset with an impressive 49.02% mean Intersection over Union (mIoU), underscoring its exceptional capabilities over state-of-the-art FCN. Moreover, when fine-tuned on smaller datasets, DPT continues to surpass existing standards, which suggests that it holds an architecture with the innate potential to utilize transformer benefits, transcending the scale of available data.

In closing, this work brings to the fore a transformative architecture—DPT—that capitalizes on the strengths of vision transformers, providing an enticing alternative to the conventional convolution-dominated paradigm in dense prediction applications. With this evolution, the approach sets a new standard in tasks necessitating perceptive depth and contextual background, portending substantial advancements in the realm of visual comprehension by deep learning models. The release of the DPT models further encourages exploration and adaptation across diversified tasks, signaling a new epoch in dense prediction methodologies.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.