Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

detrex: Benchmarking Detection Transformers (2306.07265v2)

Published 12 Jun 2023 in cs.CV

Abstract: The DEtection TRansformer (DETR) algorithm has received considerable attention in the research community and is gradually emerging as a mainstream approach for object detection and other perception tasks. However, the current field lacks a unified and comprehensive benchmark specifically tailored for DETR-based models. To address this issue, we develop a unified, highly modular, and lightweight codebase called detrex, which supports a majority of the mainstream DETR-based instance recognition algorithms, covering various fundamental tasks, including object detection, segmentation, and pose estimation. We conduct extensive experiments under detrex and perform a comprehensive benchmark for DETR-based models. Moreover, we enhance the performance of detection transformers through the refinement of training hyper-parameters, providing strong baselines for supported algorithms.We hope that detrex could offer research communities a standardized and unified platform to evaluate and compare different DETR-based models while fostering a deeper understanding and driving advancements in DETR-based instance recognition. Our code is available at https://github.com/IDEA-Research/detrex. The project is currently being actively developed. We encourage the community to use detrex codebase for further development and contributions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv preprint arXiv:2004.10934, 2020.
  2. End-to-End Object Detection with Transformers. In ECCV, pages 213–229. Springer, 2020.
  3. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv preprint arXiv:1906.07155, 2019.
  4. Group detr: Fast training convergence with decoupled one-to-many label assignment. arXiv preprint arXiv:2207.13085, 2022.
  5. Group detr v2: Strong object detector with encoder-decoder pretraining. arXiv preprint arXiv:2211.03594, 2022.
  6. Masked-attention Mask Transformer for Universal Image Segmentation. In CVPR, pages 1290–1299, 2022.
  7. Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 2021.
  8. The Cityscapes Dataset for Semantic Urban Scene Understanding. In CVPR, pages 3213–3223, 2016.
  9. Fast convergence of detr with spatially modulated co-attention. arXiv preprint arXiv:2101.07448, 2021.
  10. YOLOX: Exceeding YOLO Series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  11. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  12. ISTR: End-to-End Instance Segmentation with Transformers, 2021.
  13. You Only Segment Once: Towards Real-Time Panoptic Segmentation, 2023.
  14. DETRs with Hybrid Matching. arXiv preprint arXiv:2207.13080, 2022.
  15. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding. In ICCV, pages 1780–1790, 2021.
  16. Cornernet: Detecting objects as paired keypoints. In ECCV, pages 734–750, 2018.
  17. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In CVPR, 2022.
  18. Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation. arXiv preprint arXiv:2206.02777, 2022.
  19. Exploring Plain Vision Transformer Backbones for Object Detection. In ECCV, pages 280–296. Springer, 2022.
  20. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, 2022.
  21. Microsoft COCO: Common Objects in Context. In ECCV, pages 740–755. Springer, 2014.
  22. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. In ICLR, 2022.
  23. DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding. arXiv preprint arXiv:2211.15516, 2022.
  24. Detection Transformer with Stable Matching. arXiv preprint arXiv:2304.04742, 2023.
  25. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499, 2023.
  26. SSD: Single Shot MultiBox Detector. In ECCV, pages 21–37. Springer, 2016.
  27. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030, 2021.
  28. A ConvNet for the 2020s. CVPR, 2022.
  29. Decoupled Weight Decay Regularization. In ICLR, 2018.
  30. DETRs Beat YOLOs on Real-time Object Detection. arXiv preprint arXiv:2304.08069, 2023.
  31. Conditional DETR for Fast Training Convergence. arXiv preprint arXiv:2108.06152, 2021.
  32. The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In ICCV, pages 4990–4999, 2017.
  33. BoxeR: Box-Attention for 2D and 3D Transformers. In CVPR, pages 4773–4782, 2022.
  34. NMS Strikes Back. arXiv preprint arXiv:2212.06137, 2022.
  35. You Only Look Once: Unified, Real-Time Object Detection. In CVPR, pages 779–788, 2016.
  36. YOLO9000: Better, Faster, Stronger. In CVPR, pages 7263–7271, 2017.
  37. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017.
  38. A Strong and Reproducible Object Detector with Only Public Datasets, 2023.
  39. Objects365: A Large-scale, High-quality Dataset for Object Detection. In ICCV, pages 8430–8439, 2019.
  40. End-to-End Multi-Person Pose Estimation With Transformers. In CVPR, pages 11069–11078, June 2022.
  41. FCOS: Fully Convolutional One-Stage Object Detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9627–9636, 2019.
  42. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. arXiv preprint arXiv:2211.05778, 2022.
  43. Anchor DETR: Query Design for Transformer-Based Detector. arXiv preprint arXiv:2109.07107, 2021.
  44. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  45. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. NeurIPS, 34:12077–12090, 2021.
  46. Focal Modulation Network. NeurIPS, 35:4203–4217, 2022.
  47. Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model, 2023.
  48. Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation. arXiv preprint arXiv:2302.01593, 2023.
  49. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv preprint arXiv:2203.03605, 2022.
  50. MP-Former: Mask-Piloted Transformer for Image Segmentation. In CVPR, pages 18074–18083, 2023.
  51. DA-BEV: Depth Aware BEV Transformer for 3D Object Detection. arXiv preprint arXiv:2302.13002, 2023.
  52. Dense Distinct Query for End-to-End Object Detection. In CVPR, pages 7329–7338, 2023.
  53. Semantic Understanding of Scenes through the ADE20K Dataset. IJCV, 127:302–321, 2019.
  54. Deformable DETR: Deformable Transformers for End-to-End Object Detection. 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Tianhe Ren (25 papers)
  2. Shilong Liu (60 papers)
  3. Feng Li (286 papers)
  4. Hao Zhang (948 papers)
  5. Ailing Zeng (58 papers)
  6. Jie Yang (516 papers)
  7. Xingyu Liao (18 papers)
  8. Ding Jia (35 papers)
  9. Hongyang Li (99 papers)
  10. He Cao (18 papers)
  11. Jianan Wang (44 papers)
  12. Zhaoyang Zeng (29 papers)
  13. Xianbiao Qi (38 papers)
  14. Yuhui Yuan (42 papers)
  15. Jianwei Yang (93 papers)
  16. Lei Zhang (1689 papers)
Citations (9)

Summary

  • The paper introduces detrex as a unified, modular framework that standardizes evaluation for DETR-based algorithms.
  • It benchmarks various DETR models by refining training hyper-parameters and achieving performance gains up to 1.1 AP.
  • The platform’s flexibility facilitates reproducible comparisons, advancing research in object detection, segmentation, and pose estimation.

An Overview of detrex: Benchmarking Detection Transformers

The paper "detrex: Benchmarking Detection Transformers" addresses a pronounced gap in the field of computer vision related to the DEtection TRansformer (DETR) models. With DETR's increasing prominence in object detection and perception tasks, there exists a paucity of comprehensive and unified benchmarks tailored specifically for these models. To address this, the authors introduce a solution called "detrex," offering a modular, lightweight codebase for DETR-based algorithms.

Key Contributions

Unified and Modular Codebase

detrex is engineered to support an extensive range of DETR-based instance recognition algorithms. It covers foundational tasks such as object detection, segmentation, and pose estimation. The authors emphasize the platform's flexibility, which allows for easy adjustment of configurations and model structures. This modularity facilitates the integration and evaluation of various models under consistent conditions, thus ensuring reproducible and fair comparisons.

Comprehensive Benchmarking

The authors conducted thorough benchmarking of DETR-based models within detrex, presenting empirical evaluations across critical aspects such as model performance, training and inference efficiency, and the influence of different backbones and components on performance. This comprehensive evaluation offers robust baselines for future research.

Performance Optimization

By refining training hyper-parameters, the paper reports achieving significant performance improvements on supported algorithms, with improvements ranging from 0.2 AP to 1.1 AP. Notably, the enhancements also underscore the potential of Non-Maximum Suppression (NMS) in DETR variants, adding an extra layer of refinement.

Detailed Evaluation

The research systematically evaluates diverse DETR-based models, demonstrating detrex’s efficacy in boosting and reproducing results from various state-of-the-art architectures. The benchmarking features both recent CNN-based backbones and vision transformer models, indicating the platform's applicability across different model types.

Implications for Research and Practice

For researchers, detrex offers a standardized framework that aids in the systemic evaluation and comparison of DETR-based algorithms. It serves as a tool to dissect and understand how different components and configurations influence the performance of detection transformers.

Practically, the platform facilitates the application of DETR-based models in real-world scenarios by providing optimized, reproducible baselines. Industry practitioners can leverage detrex to streamline the deployment of object detection systems across varied tasks like segmentation and pose estimation.

Future Directions

The active development of detrex, combined with its open-source accessibility, suggests potential for ongoing enhancements and contributions from the community. Future development may involve integrating more algorithms and expanding the scope of supported tasks, thus fostering progress in the field of detection transformers.

Conclusion

detrex sets out to fill the critical need for a unified benchmarking platform in the field of DETR-based models. Through its modular design and comprehensive evaluations, it offers valuable insights for the research community, thereby supporting the advancement of detection transformers in both theoretical exploration and practical application. The codebase's modularity and extensibility suggest promising avenues for future developments in AI, specifically in the domain of visual perception tasks.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets