detrex: Benchmarking Detection Transformers (2306.07265v2)
Abstract: The DEtection TRansformer (DETR) algorithm has received considerable attention in the research community and is gradually emerging as a mainstream approach for object detection and other perception tasks. However, the current field lacks a unified and comprehensive benchmark specifically tailored for DETR-based models. To address this issue, we develop a unified, highly modular, and lightweight codebase called detrex, which supports a majority of the mainstream DETR-based instance recognition algorithms, covering various fundamental tasks, including object detection, segmentation, and pose estimation. We conduct extensive experiments under detrex and perform a comprehensive benchmark for DETR-based models. Moreover, we enhance the performance of detection transformers through the refinement of training hyper-parameters, providing strong baselines for supported algorithms.We hope that detrex could offer research communities a standardized and unified platform to evaluate and compare different DETR-based models while fostering a deeper understanding and driving advancements in DETR-based instance recognition. Our code is available at https://github.com/IDEA-Research/detrex. The project is currently being actively developed. We encourage the community to use detrex codebase for further development and contributions.
- YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv preprint arXiv:2004.10934, 2020.
- End-to-End Object Detection with Transformers. In ECCV, pages 213–229. Springer, 2020.
- MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv preprint arXiv:1906.07155, 2019.
- Group detr: Fast training convergence with decoupled one-to-many label assignment. arXiv preprint arXiv:2207.13085, 2022.
- Group detr v2: Strong object detector with encoder-decoder pretraining. arXiv preprint arXiv:2211.03594, 2022.
- Masked-attention Mask Transformer for Universal Image Segmentation. In CVPR, pages 1290–1299, 2022.
- Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 2021.
- The Cityscapes Dataset for Semantic Urban Scene Understanding. In CVPR, pages 3213–3223, 2016.
- Fast convergence of detr with spatially modulated co-attention. arXiv preprint arXiv:2101.07448, 2021.
- YOLOX: Exceeding YOLO Series in 2021. arXiv preprint arXiv:2107.08430, 2021.
- Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- ISTR: End-to-End Instance Segmentation with Transformers, 2021.
- You Only Segment Once: Towards Real-Time Panoptic Segmentation, 2023.
- DETRs with Hybrid Matching. arXiv preprint arXiv:2207.13080, 2022.
- MDETR - Modulated Detection for End-to-End Multi-Modal Understanding. In ICCV, pages 1780–1790, 2021.
- Cornernet: Detecting objects as paired keypoints. In ECCV, pages 734–750, 2018.
- DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In CVPR, 2022.
- Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation. arXiv preprint arXiv:2206.02777, 2022.
- Exploring Plain Vision Transformer Backbones for Object Detection. In ECCV, pages 280–296. Springer, 2022.
- BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, 2022.
- Microsoft COCO: Common Objects in Context. In ECCV, pages 740–755. Springer, 2014.
- DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. In ICLR, 2022.
- DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding. arXiv preprint arXiv:2211.15516, 2022.
- Detection Transformer with Stable Matching. arXiv preprint arXiv:2304.04742, 2023.
- Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499, 2023.
- SSD: Single Shot MultiBox Detector. In ECCV, pages 21–37. Springer, 2016.
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030, 2021.
- A ConvNet for the 2020s. CVPR, 2022.
- Decoupled Weight Decay Regularization. In ICLR, 2018.
- DETRs Beat YOLOs on Real-time Object Detection. arXiv preprint arXiv:2304.08069, 2023.
- Conditional DETR for Fast Training Convergence. arXiv preprint arXiv:2108.06152, 2021.
- The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In ICCV, pages 4990–4999, 2017.
- BoxeR: Box-Attention for 2D and 3D Transformers. In CVPR, pages 4773–4782, 2022.
- NMS Strikes Back. arXiv preprint arXiv:2212.06137, 2022.
- You Only Look Once: Unified, Real-Time Object Detection. In CVPR, pages 779–788, 2016.
- YOLO9000: Better, Faster, Stronger. In CVPR, pages 7263–7271, 2017.
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017.
- A Strong and Reproducible Object Detector with Only Public Datasets, 2023.
- Objects365: A Large-scale, High-quality Dataset for Object Detection. In ICCV, pages 8430–8439, 2019.
- End-to-End Multi-Person Pose Estimation With Transformers. In CVPR, pages 11069–11078, June 2022.
- FCOS: Fully Convolutional One-Stage Object Detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9627–9636, 2019.
- InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. arXiv preprint arXiv:2211.05778, 2022.
- Anchor DETR: Query Design for Transformer-Based Detector. arXiv preprint arXiv:2109.07107, 2021.
- Detectron2. https://github.com/facebookresearch/detectron2, 2019.
- SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. NeurIPS, 34:12077–12090, 2021.
- Focal Modulation Network. NeurIPS, 35:4203–4217, 2022.
- Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model, 2023.
- Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation. arXiv preprint arXiv:2302.01593, 2023.
- DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv preprint arXiv:2203.03605, 2022.
- MP-Former: Mask-Piloted Transformer for Image Segmentation. In CVPR, pages 18074–18083, 2023.
- DA-BEV: Depth Aware BEV Transformer for 3D Object Detection. arXiv preprint arXiv:2302.13002, 2023.
- Dense Distinct Query for End-to-End Object Detection. In CVPR, pages 7329–7338, 2023.
- Semantic Understanding of Scenes through the ADE20K Dataset. IJCV, 127:302–321, 2019.
- Deformable DETR: Deformable Transformers for End-to-End Object Detection. 2021.
- Tianhe Ren (25 papers)
- Shilong Liu (60 papers)
- Feng Li (286 papers)
- Hao Zhang (948 papers)
- Ailing Zeng (58 papers)
- Jie Yang (516 papers)
- Xingyu Liao (18 papers)
- Ding Jia (35 papers)
- Hongyang Li (99 papers)
- He Cao (18 papers)
- Jianan Wang (44 papers)
- Zhaoyang Zeng (29 papers)
- Xianbiao Qi (38 papers)
- Yuhui Yuan (42 papers)
- Jianwei Yang (93 papers)
- Lei Zhang (1689 papers)