detrex: Benchmarking Detection Transformers (2306.07265v2)
Abstract: The DEtection TRansformer (DETR) algorithm has received considerable attention in the research community and is gradually emerging as a mainstream approach for object detection and other perception tasks. However, the current field lacks a unified and comprehensive benchmark specifically tailored for DETR-based models. To address this issue, we develop a unified, highly modular, and lightweight codebase called detrex, which supports a majority of the mainstream DETR-based instance recognition algorithms, covering various fundamental tasks, including object detection, segmentation, and pose estimation. We conduct extensive experiments under detrex and perform a comprehensive benchmark for DETR-based models. Moreover, we enhance the performance of detection transformers through the refinement of training hyper-parameters, providing strong baselines for supported algorithms.We hope that detrex could offer research communities a standardized and unified platform to evaluate and compare different DETR-based models while fostering a deeper understanding and driving advancements in DETR-based instance recognition. Our code is available at https://github.com/IDEA-Research/detrex. The project is currently being actively developed. We encourage the community to use detrex codebase for further development and contributions.
- YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv preprint arXiv:2004.10934, 2020.
- End-to-End Object Detection with Transformers. In ECCV, pages 213–229. Springer, 2020.
- MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv preprint arXiv:1906.07155, 2019.
- Group detr: Fast training convergence with decoupled one-to-many label assignment. arXiv preprint arXiv:2207.13085, 2022.
- Group detr v2: Strong object detector with encoder-decoder pretraining. arXiv preprint arXiv:2211.03594, 2022.
- Masked-attention Mask Transformer for Universal Image Segmentation. In CVPR, pages 1290–1299, 2022.
- Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 2021.
- The Cityscapes Dataset for Semantic Urban Scene Understanding. In CVPR, pages 3213–3223, 2016.
- Fast convergence of detr with spatially modulated co-attention. arXiv preprint arXiv:2101.07448, 2021.
- YOLOX: Exceeding YOLO Series in 2021. arXiv preprint arXiv:2107.08430, 2021.
- Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- ISTR: End-to-End Instance Segmentation with Transformers, 2021.
- You Only Segment Once: Towards Real-Time Panoptic Segmentation, 2023.
- DETRs with Hybrid Matching. arXiv preprint arXiv:2207.13080, 2022.
- MDETR - Modulated Detection for End-to-End Multi-Modal Understanding. In ICCV, pages 1780–1790, 2021.
- Cornernet: Detecting objects as paired keypoints. In ECCV, pages 734–750, 2018.
- DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In CVPR, 2022.
- Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation. arXiv preprint arXiv:2206.02777, 2022.
- Exploring Plain Vision Transformer Backbones for Object Detection. In ECCV, pages 280–296. Springer, 2022.
- BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, 2022.
- Microsoft COCO: Common Objects in Context. In ECCV, pages 740–755. Springer, 2014.
- DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. In ICLR, 2022.
- DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding. arXiv preprint arXiv:2211.15516, 2022.
- Detection Transformer with Stable Matching. arXiv preprint arXiv:2304.04742, 2023.
- Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499, 2023.
- SSD: Single Shot MultiBox Detector. In ECCV, pages 21–37. Springer, 2016.
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030, 2021.
- A ConvNet for the 2020s. CVPR, 2022.
- Decoupled Weight Decay Regularization. In ICLR, 2018.
- DETRs Beat YOLOs on Real-time Object Detection. arXiv preprint arXiv:2304.08069, 2023.
- Conditional DETR for Fast Training Convergence. arXiv preprint arXiv:2108.06152, 2021.
- The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In ICCV, pages 4990–4999, 2017.
- BoxeR: Box-Attention for 2D and 3D Transformers. In CVPR, pages 4773–4782, 2022.
- NMS Strikes Back. arXiv preprint arXiv:2212.06137, 2022.
- You Only Look Once: Unified, Real-Time Object Detection. In CVPR, pages 779–788, 2016.
- YOLO9000: Better, Faster, Stronger. In CVPR, pages 7263–7271, 2017.
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017.
- A Strong and Reproducible Object Detector with Only Public Datasets, 2023.
- Objects365: A Large-scale, High-quality Dataset for Object Detection. In ICCV, pages 8430–8439, 2019.
- End-to-End Multi-Person Pose Estimation With Transformers. In CVPR, pages 11069–11078, June 2022.
- FCOS: Fully Convolutional One-Stage Object Detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9627–9636, 2019.
- InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. arXiv preprint arXiv:2211.05778, 2022.
- Anchor DETR: Query Design for Transformer-Based Detector. arXiv preprint arXiv:2109.07107, 2021.
- Detectron2. https://github.com/facebookresearch/detectron2, 2019.
- SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. NeurIPS, 34:12077–12090, 2021.
- Focal Modulation Network. NeurIPS, 35:4203–4217, 2022.
- Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model, 2023.
- Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation. arXiv preprint arXiv:2302.01593, 2023.
- DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv preprint arXiv:2203.03605, 2022.
- MP-Former: Mask-Piloted Transformer for Image Segmentation. In CVPR, pages 18074–18083, 2023.
- DA-BEV: Depth Aware BEV Transformer for 3D Object Detection. arXiv preprint arXiv:2302.13002, 2023.
- Dense Distinct Query for End-to-End Object Detection. In CVPR, pages 7329–7338, 2023.
- Semantic Understanding of Scenes through the ADE20K Dataset. IJCV, 127:302–321, 2019.
- Deformable DETR: Deformable Transformers for End-to-End Object Detection. 2021.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.