Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 56 tok/s
Gemini 2.5 Pro 39 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 155 tok/s Pro
GPT OSS 120B 476 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection (2112.01526v2)

Published 2 Dec 2021 in cs.CV

Abstract: In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification. Code and models are available at https://github.com/facebookresearch/mvit.

Citations (589)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces MViTv2 with decomposed relative positional embeddings and residual pooling connections to enhance spatial modeling and information flow.
  • Empirical results demonstrate high performance, including 88.8% accuracy on ImageNet and 58.7 box AP on COCO, showcasing its effectiveness.
  • The unified Transformer architecture simplifies model selection, offering scalable solutions for image classification, object detection, and video recognition.

Multiscale Vision Transformers for Image Recognition and Detection: A Technical Summary

The paper "Multiscale Vision Transformers for Image Classification and Object Detection" investigates the efficacy of Multiscale Vision Transformers (MViT) as a unified architecture addressing tasks across image and video classification, as well as object detection. This summary provides an expert overview of the proposed improvements, empirical results, and the potential implications for future AI research.

Core Contributions

The paper introduces MViTv2, an improved version of the Multiscale Vision Transformer architecture, which advances upon the original MViT with two primary enhancements: decomposed relative positional embeddings and residual pooling connections.

  1. Decomposed Relative Positional Embeddings: These embeddings inject position information into Transformer blocks, addressing limitations of absolute positional embeddings by incorporating shift-invariance principles. This allows the model to enhance its spatial modeling capabilities without increasing computational complexity.
  2. Residual Pooling Connections: This mechanism compensates for effects of pooling strides in attention computation, enhancing information flow and facilitating training.

Empirical Evaluation

The researchers conducted extensive experiments across ImageNet-1K, COCO, and Kinetics datasets, evaluating the MViTv2 variants on image classification, object detection, instance segmentation, and video classification. The results demonstrate notable gains over existing architectures:

  • ImageNet Classification: MViTv2 achieves a top accuracy of 88.8% when pre-trained on ImageNet-21K, outperforming prior models with reduced computational cost.
  • COCO Object Detection: The architecture achieves an impressive 58.7 box AP, showcasing its effectiveness as a backbone for object detection tasks.
  • Kinetics Video Classification: The architecture achieves 86.1% accuracy on Kinetics-400, setting a new benchmark for video recognition models without external large-scale pre-training.

Strong Numerical Results

The MViTv2 models exhibit significant improvements in various tasks. For instance, the MViTv2-L records an accuracy of 86.0% on ImageNet-1K with a standard training protocol, showcasing its efficiency and accuracy without reliance on additional data or distillation techniques. In object detection, MViTv2-L achieves 55.8 box AP utilizing advanced training strategies, highlighting its superior performance and scalability.

Implications and Future Directions

The paper posits MViTv2 as a versatile and scalable backbone for divergent visual recognition tasks. The unified framework and architectural enhancements make it a strong candidate for future research in both academia and industry, potentially simplifying model selection across tasks.

Future avenues for exploration include further scalability of MViTv2, both for smaller models targeting mobile applications and larger models leveraging more extensive datasets. Additionally, the paper opens new possibilities for integrating advanced self-attention mechanisms in computer vision tasks, likely influencing the design of future architectures in the domain.

Conclusion

The introduction of MViTv2 demonstrates a significant advancement in the development of unified Transformer architectures for vision tasks. By combining multiscale processing with novel architectural components, this paper offers a compelling direction for achieving state-of-the-art performance across multiple challenging benchmarks, laying the groundwork for future exploration in this rapidly evolving field.

X Twitter Logo Streamline Icon: https://streamlinehq.com