PVT v2: Improved Baselines with Pyramid Vision Transformer

Published 25 Jun 2021 in cs.CV | (2106.13797v7)

Abstract: Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.

Abstract PDF Upgrade to Chat

Authors (9)

Citations (1,338)

View on Semantic Scholar

Summary

The paper introduces three key enhancements—linear spatial reduction attention, overlapping patch embedding, and a convolutional feed-forward network—to improve computational efficiency and feature extraction.
Experimental results show that PVT v2 achieves 83.8% top-1 accuracy on ImageNet and notable improvements in object detection performance on the COCO benchmark.
The architecture effectively reduces complexity while enhancing local continuity and adaptability across classification, detection, and segmentation tasks.

Pyramid Vision Transformer v2: Enhancements and Performance Benchmarking

The paper "PVT v2: Improved Baselines with Pyramid Vision Transformer" addresses notable advancements in utilizing Transformer architectures for computer vision tasks, which have traditionally been dominated by Convolutional Neural Networks (CNNs). The authors present an improved version of the Pyramid Vision Transformer (PVT v1), highlighting three specific enhancements: a linear complexity attention layer, overlapping patch embedding, and a convolutional feed-forward network. These upgrades not only reduce computational complexity but also enhance the performance of fundamental vision tasks such as classification, detection, and segmentation.

The work follows ongoing trends in the application of Transformer architectures to vision problems, diverging from CNN-centric approaches. Vision Transformer (ViT) initially demonstrated the effectiveness of pure Transformer models for image classification. Concurrently, PVT v1 expanded this success to dense prediction tasks such as object detection and segmentation, surpassing traditional CNN-based methods. The paper situates PVT v2 among recent developments including Swin Transformer, CoaT, LeViT, and Twins, each introducing innovations aimed to refine and enhance vision Transformers' capabilities.

Methodology

Limitations in PVT v1

The authors enumerate three primary limitations associated with PVT v1:

High computational complexity when processing high-resolution images.
Loss of local continuity due to non-overlapping patch treatment.
Inflexibility due to fixed-size positional encoding.

Enhancements Introduced in PVT v2

To address these issues, the authors propose:

Linear Spatial Reduction Attention (LSRA): LSRA mitigates the high computational cost by using average pooling to reduce spatial dimensions before the attention operation, thereby achieving linear computational complexity. This is crucial for maintaining efficiency with high-resolution inputs.
Overlapping Patch Embedding (OPE): This technique employs overlapping windows for patch embedding, preserving local continuity and spatial relationships in the image. Convolutional operations with padding ensure resolution consistency while enhancing local feature aggregation.
Convolutional Feed-Forward Network (CFFN): A 3x3 depth-wise convolutional layer is inserted between fully-connected layers and GELU activation in the feed-forward network. This strategy removes the fixed-size positional encodings, allowing the model to handle variable-resolution inputs flexibly.

Comparative Analysis and Numerical Results

The paper provides robust experimental analysis across several benchmarks:

Image Classification: On ImageNet-1K, PVT v2 models consistently outperform PVT v1 and other Transformer variants. For instance, PVT v2-B5 achieves an 83.8% top-1 accuracy, surpassing Swin-B and Twins-SVT-L by 0.5% while maintaining fewer parameters and GFLOPs.
Object Detection: Evaluations on COCO dataset, integrating PVT v2 into multiple prominent detectors like RetinaNet, Mask R-CNN, and ATSS, indicate significant performance gains. PVT v2 notably boosts the Average Precision (AP) for each detector. For instance, PVT v2-B4 achieves 47.5 AP $^{\text{b}}$ with Mask R-CNN, a 4.6 points improvement over the equivalent PVT v1-based configuration.
Semantic Segmentation: In semantic segmentation, using the ADE20K dataset benchmark, PVT v2 models achieve top-tier performance. PVT v2-B5 records a mIoU of 48.7%, effectively outperforming prior versions and competitive counterparts. The overlapping and convolutional enhancements significantly improve feature extraction and spatial relationships.

Implications and Future Directions

PVT v2's advancements position it as a competitive and efficient backbone for various vision tasks. Its success suggests potential further exploration in incorporating hybrid architectures combining strengths of both CNNs and Transformers. Moreover, continued optimization in attention mechanisms and embedding methodologies can pave the way for next-generation vision models capable of scaling across diverse applications, from real-time image processing to complex scene understanding.

In summary, the Pyramid Vision Transformer v2 introduces substantial enhancements in attention layer complexity, patch embedding strategy, and feed-forward processing, resulting in significant performance improvements across multiple computer vision tasks. These insights establish PVT v2 as a robust baseline for future Transformer-based research in vision domains.

Markdown Report Issue