MViTv2: Improved Multiscale Vision Transformers for Classification and Detection (2112.01526v2)

Published 2 Dec 2021 in cs.CV

Abstract: In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification. Code and models are available at https://github.com/facebookresearch/mvit.

Citations (589)

View on Semantic Scholar

Summary

The paper introduces MViTv2 with decomposed relative positional embeddings and residual pooling connections to enhance spatial modeling and information flow.
Empirical results demonstrate high performance, including 88.8% accuracy on ImageNet and 58.7 box AP on COCO, showcasing its effectiveness.
The unified Transformer architecture simplifies model selection, offering scalable solutions for image classification, object detection, and video recognition.

Multiscale Vision Transformers for Image Recognition and Detection: A Technical Summary

The paper "Multiscale Vision Transformers for Image Classification and Object Detection" investigates the efficacy of Multiscale Vision Transformers (MViT) as a unified architecture addressing tasks across image and video classification, as well as object detection. This summary provides an expert overview of the proposed improvements, empirical results, and the potential implications for future AI research.

Core Contributions

The paper introduces MViTv2, an improved version of the Multiscale Vision Transformer architecture, which advances upon the original MViT with two primary enhancements: decomposed relative positional embeddings and residual pooling connections.

Decomposed Relative Positional Embeddings: These embeddings inject position information into Transformer blocks, addressing limitations of absolute positional embeddings by incorporating shift-invariance principles. This allows the model to enhance its spatial modeling capabilities without increasing computational complexity.
Residual Pooling Connections: This mechanism compensates for effects of pooling strides in attention computation, enhancing information flow and facilitating training.

Empirical Evaluation

The researchers conducted extensive experiments across ImageNet-1K, COCO, and Kinetics datasets, evaluating the MViTv2 variants on image classification, object detection, instance segmentation, and video classification. The results demonstrate notable gains over existing architectures:

ImageNet Classification: MViTv2 achieves a top accuracy of 88.8% when pre-trained on ImageNet-21K, outperforming prior models with reduced computational cost.
COCO Object Detection: The architecture achieves an impressive 58.7 box AP, showcasing its effectiveness as a backbone for object detection tasks.
Kinetics Video Classification: The architecture achieves 86.1% accuracy on Kinetics-400, setting a new benchmark for video recognition models without external large-scale pre-training.

Strong Numerical Results

The MViTv2 models exhibit significant improvements in various tasks. For instance, the MViTv2-L records an accuracy of 86.0% on ImageNet-1K with a standard training protocol, showcasing its efficiency and accuracy without reliance on additional data or distillation techniques. In object detection, MViTv2-L achieves 55.8 box AP utilizing advanced training strategies, highlighting its superior performance and scalability.

Implications and Future Directions

The paper posits MViTv2 as a versatile and scalable backbone for divergent visual recognition tasks. The unified framework and architectural enhancements make it a strong candidate for future research in both academia and industry, potentially simplifying model selection across tasks.

Future avenues for exploration include further scalability of MViTv2, both for smaller models targeting mobile applications and larger models leveraging more extensive datasets. Additionally, the paper opens new possibilities for integrating advanced self-attention mechanisms in computer vision tasks, likely influencing the design of future architectures in the domain.

Conclusion

The introduction of MViTv2 demonstrates a significant advancement in the development of unified Transformer architectures for vision tasks. By combining multiscale processing with novel architectural components, this paper offers a compelling direction for achieving state-of-the-art performance across multiple challenging benchmarks, laying the groundwork for future exploration in this rapidly evolving field.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/kevin__dave/status/1869321350051184643