An Extendable, Efficient and Effective Transformer-based Object Detector

Published 17 Apr 2022 in cs.CV, cs.AI, and cs.LG | (2204.07962v1)

Abstract: Transformers have been widely used in numerous vision problems especially for visual recognition and detection. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. In addition, we extend it to ViDT+ to support joint-task learning for object detection and instance segmentation. Specifically, we attach an efficient multi-scale feature fusion layer and utilize two more auxiliary training losses, IoU-aware loss and token labeling loss. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and its extended ViDT+ achieves 53.2AP owing to its high scalability for large models. The source code and trained models are available at https://github.com/naver-ai/vidt.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces the ViDT model that uses a reconfigured attention module to lower computational complexity while maintaining high accuracy.
It proposes an encoder-free neck with a lightweight transformer decoder to efficiently fuse multi-scale features.
ViDT+ extends the model for joint object detection and instance segmentation, achieving a noteworthy 53.2 AP on the COCO dataset.

An Extendable, Efficient, and Effective Transformer-based Object Detector

This paper addresses the integration of Vision and Detection Transformers to create a robust and efficient object detection architecture. Leveraging the recent advancements in vision and detection transformers, this work proposes the Vision and Detection Transformers (ViDT) model, designed to improve the efficiency and scalability of object detectors while maintaining high accuracy.

Main Contributions

Reconfigured Attention Module (RAM): The paper introduces the RAM to enhance the Swin Transformer's capabilities, enabling it to function as a standalone object detector. By decomposing single global attention into patch-related and detection-related components, the authors mitigate the computational complexity traditionally associated with transformers. This adjustment allows for a linear complexity concerning object detection, unlike the quadratic complexity of YOLOS.
Encoder-Free Neck Structure: The paper proposes removing the transformer encoder from the Swin Transformer's neck, thereby reducing computational overhead. ViDT utilizes only a lightweight transformer decoder at its neck, which effectively fuses multi-scale features in a computationally efficient manner.
Extension to ViDT+: The researchers extend ViDT to support joint-task learning for object detection and instance segmentation, named ViDT+. This extension incorporates the Efficient Pyramid Feature Fusion (EPFF) module and the Unified Query Representation (UQR) module to provide comprehensive multi-task learning capabilities.
Auxiliary Losses for Improved Training: For better training performance, ViDT+ integrates additional losses such as IoU-aware and token labeling losses, which help the model achieve superior detection and segmentation results.

Numerical Results and Implications

The extensive evaluations on the Microsoft COCO dataset demonstrate significant improvements. ViDT surpasses existing fully transformer-based detectors in average precision (AP) and latency, achieving a notable 53.2 AP with the extended ViDT+. These results illustrate ViDT's scalability, particularly with large model configurations like Swin-base, where a noteworthy trade-off between AP and computation is observed.

Practical and Theoretical Implications

Practically, ViDT sets a new benchmark in designing scalable object detectors using transformers by reducing the computational overhead without compromising on performance. Theoretical implications revolve around its approach to integrating patch-based and detection-based attention mechanisms, which could profoundly influence future work in machine learning and computer vision, encouraging more efficient architectures.

Future Directions

Future research could explore the integration of ViDT with other emerging transformer variants to enhance performance further. Additionally, investigating ViDT's adaptability to other dense prediction tasks and its potential role in broader AI applications could be exciting directions for exploration. Adapting this model for real-time applications that require high-speed processing without sacrificing accuracy could be a particularly lucrative area of development.

In summary, the research presented in this paper provides compelling evidence of the effectiveness of transformer-based methods in object detection and instance segmentation tasks, offering insights into both architectural improvements and practical deployment strategies in AI-enhanced visual computing.

Markdown Report Issue