Emergent Mind

RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer

(2407.17140)

Published Jul 24, 2024 in cs.CV

Abstract

In this report, we present RT-DETRv2, an improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR, and opens up a set of bag-of-freebies for flexibility and practicality, as well as optimizing the training strategy to achieve enhanced performance. To improve the flexibility, we suggest setting a distinct number of sampling points for features at different scales in the deformable attention to achieve selective multi-scale feature extraction by the decoder. To enhance practicality, we propose an optional discrete sampling operator to replace the grid_sample operator that is specific to RT-DETR compared to YOLOs. This removes the deployment constraints typically associated with DETRs. For the training strategy, we propose dynamic data augmentation and scale-adaptive hyperparameters customization to improve performance without loss of speed. Source code and pre-trained models will be available at https://github.com/lyuwenyu/RT-DETR.

Overview

RT-DETRv2 enhances real-time object detection transformers by incorporating a bag-of-freebies and optimized training strategies to improve performance without sacrificing speed.
Key innovations include selective multi-scale feature extraction, deployment-friendly sampling operators, dynamic data augmentation, and scale-adaptive hyperparameters.
Extensive evaluations on the COCO dataset demonstrate significant improvements in detection metrics over RT-DETR, with ablation studies confirming the efficacy of the proposed enhancements.

RT-DETRv2: Enhanced Real-Time DEtection TRansformer with Bag-of-Freebies

The paper "RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer" by Wenyu Lv et al. introduces RT-DETRv2, an evolution of the original RT-DETR model. This work focuses on enhancing real-time object detection transformers by leveraging a set of design improvements referred to as bag-of-freebies alongside optimized training strategies. The improvements aim to increase the flexibility, practicality, and performance of the detection transformer without compromising speed, which is critical for real-time applications.

Key Contributions

The primary contributions of RT-DETRv2 are as follows:

Selective Multi-Scale Feature Extraction:

RT-DETRv2 introduces a customization where distinct numbers of sampling points are set for features at different scales within the deformable attention module. This selective extraction mechanism enhances the adaptability and performance of the decoder by aligning feature extraction with the intrinsic properties of multi-scale features.

Deployment-Friendly Sampling:

An optional discrete sampling operator is proposed to replace the grid_sample operator. This alteration eliminates the deployment constraints typically associated with detection transformers such as RT-DETR, making RT-DETRv2 more versatile and easier to integrate into various production environments.

Dynamic Data Augmentation:

A dynamic data augmentation strategy is employed to maintain a high level of detection performance throughout the training process. Stronger augmentations are used during the early stages of training to improve generalizability, while the level of augmentation is reduced in later stages to better align the model with the target domain.

Scale-Adaptive Hyperparameters:

To optimize training for detectors of varying sizes, RT-DETRv2 introduces scale-adaptive hyperparameters customization. This approach adjusts the learning rates based on the feature quality and size of the pre-trained backbones, ensuring that smaller backbones with lower feature quality benefit from higher learning rates, enhancing their performance.

Experimental Evaluation

Implementation Details

RT-DETRv2 retains the basic framework of RT-DETR, primarily modifying the deformable attention module. The implementation utilizes ResNet backbones pretrained on ImageNet, with training conducted using the AdamW optimizer and exponential moving average (EMA) for stabilization.

Results

The model's performance was evaluated on the COCO dataset, with significant improvements observed across various metrics. For example, RT-DETRv2-S achieves an AP (Average Precision) of 47.9 and an AP50 (AP at IoU 0.50) of 64.9, showing improvements over its predecessor, RT-DETR-S, which recorded an AP of 46.5 and an AP50 of 63.8. This trend of incremental improvements was consistent across different model scales, as shown in the detailed comparisons (refer to Table [tab:table]).

Ablation Studies

The ablation studies underscore the efficacy of the proposed improvements:

Sampling Points: The study revealed that even with a reduced number of sampling points, the performance degradation was minimal. This finding suggests that the proposed adjustments do not significantly impair practical deployment scenarios.
Discrete Sampling: The transition to discrete sampling showed minor reductions in AP_50, indicating the potential trade-off for eliminating deployment constraints while maintaining competitive performance.

Theoretical and Practical Implications

The enhancements proposed in RT-DETRv2 have both immediate and long-term implications. The improved flexibility and practicality make RT-DETRv2 an attractive choice for industrial and real-time applications where deployment constraints and speed are critical. The innovative training strategies contribute to the broader understanding of how dynamic augmentation and hyperparameter customization can be leveraged to optimize model performance across different scales.

Furthermore, the selective multi-scale feature extraction and discrete sampling approaches provide new avenues for exploration in future research on detection transformers. These techniques could be generalized and applied to other transformer-based models in various domains beyond object detection.

Conclusion

RT-DETRv2 marks an incremental yet significant advancement in the DETR family by integrating a series of strategic improvements designed to enhance flexibility, practicality, and performance metrics. The optimized training strategies and deployment-friendly modifications ensure that RT-DETRv2 can meet the requirements of diverse real-time applications while maintaining and, in some cases, surpassing the performance benchmarks set by its predecessor. Future research could further refine these concepts, exploring their applicability and efficacy in broader AI contexts.

Create an account to read this summary for free:

GitHub

GitHub - lyuwenyu/RT-DETR: [CVPR 2024] Official RT-DETR (RTDETR paddle pytorch), Real-Time DEtection TRansformer, DETRs Beat YOLOs on Real-time Object Detection. 🔥 🔥 🔥 (2,244 stars)

https://twitter.com/CSVisionPapers/status/1816547216644870263