RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation (2312.07526v2)
Abstract: Real-time multi-person pose estimation presents significant challenges in balancing speed and precision. While two-stage top-down methods slow down as the number of people in the image increases, existing one-stage methods often fail to simultaneously deliver high accuracy and real-time performance. This paper introduces RTMO, a one-stage pose estimation framework that seamlessly integrates coordinate classification by representing keypoints using dual 1-D heatmaps within the YOLO architecture, achieving accuracy comparable to top-down methods while maintaining high speed. We propose a dynamic coordinate classifier and a tailored loss function for heatmap learning, specifically designed to address the incompatibilities between coordinate classification and dense prediction models. RTMO outperforms state-of-the-art one-stage pose estimators, achieving 1.1% higher AP on COCO while operating about 9 times faster with the same backbone. Our largest model, RTMO-l, attains 74.8% AP on COCO val2017 and 141 FPS on a single V100 GPU, demonstrating its efficiency and accuracy. The code and models are available at https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo.
- 2D human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
- Posetrack: A benchmark for human pose estimation and tracking. In CVPR, 2018.
- PaddlePaddle Authors. Paddledetection, object detection and instance segmentation toolkit based on paddlepaddle. https://github.com/PaddlePaddle/PaddleDetection.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. TPAMI, 2019.
- Data uncertainty learning in face recognition. In CVPR, 2020.
- Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR, 2020.
- Centernet: Keypoint triplets for object detection. In ICCV, 2019.
- Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. TPAMI, 2022.
- YOLOX: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
- Bottom-up human pose estimation via disentangled keypoint regression. In CVPR, 2021.
- Deep residual learning for image recognition. In CVPR, 2016.
- Bounding box regression with uncertainty for accurate object detection. In CVPR, 2019.
- Transformer quality in linear time. In ICML, 2022.
- Towards understanding action recognition. In ICCV, 2013.
- RTMPose: Real-time multi-person pose estimation based on mmpose. arXiv preprint arXiv:2303.07399, 2023.
- YOLO by Ultralytics. https://github.com/ultralytics/ultralytics, 2023. Accessed: February 30, 2023.
- Parallel feature pyramid network for object detection. In ECCV, 2018.
- Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR, 2019.
- Human pose regression with residual log-likelihood estimation. In ICCV, 2021a.
- Generalized focal loss: Towards efficient representation learning for dense object detection. TPAMI, 2022a.
- Tokenpose: Learning keypoint tokens for human pose estimation. In ICCV, 2021b.
- Simcc: A simple coordinate classification perspective for human pose estimation. In ECCV, 2022b.
- Microsoft COCO: Common objects in context. In ECCV, 2014.
- Ssd: Single shot multibox detector. In ECCV, 2016.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Rethinking the heatmap regression for bottom-up human pose estimation. In CVPR, 2021.
- Detrs beat yolos on real-time object detection. arXiv preprint arXiv:2304.08069, 2023.
- RTMDet: An empirical study of designing real-time object detectors. arXiv preprint arXiv:2212.07784, 2022.
- Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. In CVPR Workshop, 2022.
- Fcpose: Fully convolutional multi-person pose estimation with dynamic instance-aware convolutions. In CVPR, 2021.
- Poseur: Direct human pose regression with transformers. In ECCV, 2022.
- Rethinking keypoint representations: Modeling keypoints and poses as objects for multi-person human pose estimation. In ECCV, 2022.
- Single-stage multi-person pose machines. In ICCV, 2019.
- Inspose: instance-aware networks for single-stage multi-person pose estimation. In ACMMM, 2021.
- End-to-end multi-person pose estimation with transformers. In CVPR, 2022a.
- End-to-end multi-person pose estimation with transformers. In CVPR, 2022b.
- Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
- Directpose: Direct end-to-end multi-person pose estimation. arXiv preprint arXiv:1911.07451, 2019a.
- Fcos: Fully convolutional one-stage object detection. In CVPR, 2019b.
- Attention is all you need. In NeurIPS, 2017.
- Contextual instance decoupling for robust multi-person pose estimation. In CVPR, 2022.
- Ai challenger: A large-scale dataset for going deeper in image understanding. arXiv preprint arXiv:1711.06475, 2017.
- Simple baselines for human pose estimation and tracking. In ECCV, 2018.
- Vitpose: Simple vision transformer baselines for human pose estimation. In NeurIPS, 2022.
- Explicit box detection unifies end-to-end multi-person pose estimation. In ICLR, 2023.
- Transpose: Keypoint localization via transformer. In ICCV, 2021.
- HRFormer: High-resolution vision transformer for dense predict. In NeurIPS, 2021.
- Mixup: Beyond empirical risk minimization. In ICLR, 2018.
- Varifocalnet: An iou-aware dense object detector. In CVPR, 2021.
- Objects as points. arXiv preprint arXiv:1904.07850, 2019.
- Peng Lu (86 papers)
- Tao Jiang (274 papers)
- Yining Li (29 papers)
- Xiangtai Li (128 papers)
- Kai Chen (512 papers)
- Wenming Yang (71 papers)