Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

Published 6 Apr 2022 in cs.CV | (2204.02964v2)

Abstract: We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e.g., only 25% $\sim$ 50% of the input embeddings. (ii) In order to construct multi-scale representations for object detection from single-scale ViT, a randomly initialized compact convolutional stem supplants the pre-trained large kernel patchify stem, and its intermediate features can naturally serve as the higher resolution inputs of a feature pyramid network without further upsampling or other manipulations. While the pre-trained ViT is only regarded as the 3$^{rd}$-stage of our detector's backbone instead of the whole feature extractor. This results in a ConvNet-ViT hybrid feature extractor. The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.5 box AP and 2.6 mask AP on COCO, and achieves better results compared with the previous best adapted vanilla ViT detector using a more modest fine-tuning recipe while converging 2.8$\times$ faster. Code and pre-trained models are available at https://github.com/hustvl/MIMDet.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (52)

View on Semantic Scholar

Summary

The paper demonstrates that a MIM pre-trained ViT can effectively detect objects using only 25%-50% of input embeddings, reducing computational cost.
The authors propose a hybrid ConvNet-ViT architecture that replaces the patchify stem with a compact convolutional stem to generate multi-scale features.
The MimDet approach achieves faster convergence and improved performance on COCO, outperforming previous adaptations of vanilla ViT detectors.

Overview of "Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection"

This paper presents "MimDet," an approach for adapting Masked Image Modeling (MIM) pre-trained Vanilla Vision Transformers (ViTs) for object detection. The authors leverage two key observations: first, that these models perform significantly well even with only 25% to 50% partial observations of the input embeddings; second, that constructing multi-scale representations can be simplified by supplementing a randomly initialized compact convolutional stem, replacing the original large kernel patchify stem.

Key Contributions

Partial Input Utilization: The study reveals that a MIM pre-trained ViT encoder can effectively handle object-level recognition tasks when exposed to a fraction of the input embeddings, specifically between 25% to 50%. This results in reduced computational requirements while maintaining competitive accuracy.
Hybrid Architecture Design: By introducing a convolutional stem to replace the patchify stem and utilizing the ViT encoder as part of a hierarchical feature extractor, the authors develop a ConvNet-ViT hybrid. This architecture is more efficient in creating multi-scale pyramid representations essential for object detection.
Efficient Training and Convergence: The proposed MimDet approach shows superior convergence rates, being 2.8 times faster than previous best-adapted vanilla ViT detectors. With a streamlined fine-tuning strategy, MimDet achieves a balance between model complexity and training efficiency.

Numerical Results

MimDet demonstrates a 2.5 point improvement in box detection and 2.6 in mask detection on the COCO dataset compared to Swin Transformers. Additionally, it surpasses previously adapted vanilla ViT detectors using a more modest fine-tuning setup, with notable improvements in both box and mask accuracy, highlighting both computational and practical implications.

Implications and Future Directions

Scalability and Adaptation: The research suggests that developing specific architectures for object detection might become less crucial as more general representations are leveraged effectively. This might lead to broader implications for model design across visual understanding tasks.
Potential for Other Domains: The work paves the way for similar methods in domains that can benefit from transformer models, including video analysis, multi-task learning, and multi-modal integration.
Exploration of MIM in ViTs: Insights from leveraging MAE (Masked Autoencoder) pre-trained decoders indicate potential research directions that could explore further adaptations and optimizations.

Conclusion

This paper contributes to the evolving discourse on adapting Vision Transformers for practical tasks such as object detection. By introducing innovative solutions to utilize MIM pre-trained ViTs, the authors demonstrate improved performance and efficiency, suggesting a promising trajectory for future research in artificial intelligence and computer vision. The findings encourage leveraging pre-trained representations through straightforward architectural adjustments, resonating with current trends in machine learning toward simplicity and effectiveness.

Markdown Report Issue