Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation (2312.07526v2)

Published 12 Dec 2023 in cs.CV

Abstract: Real-time multi-person pose estimation presents significant challenges in balancing speed and precision. While two-stage top-down methods slow down as the number of people in the image increases, existing one-stage methods often fail to simultaneously deliver high accuracy and real-time performance. This paper introduces RTMO, a one-stage pose estimation framework that seamlessly integrates coordinate classification by representing keypoints using dual 1-D heatmaps within the YOLO architecture, achieving accuracy comparable to top-down methods while maintaining high speed. We propose a dynamic coordinate classifier and a tailored loss function for heatmap learning, specifically designed to address the incompatibilities between coordinate classification and dense prediction models. RTMO outperforms state-of-the-art one-stage pose estimators, achieving 1.1% higher AP on COCO while operating about 9 times faster with the same backbone. Our largest model, RTMO-l, attains 74.8% AP on COCO val2017 and 141 FPS on a single V100 GPU, demonstrating its efficiency and accuracy. The code and models are available at https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. 2D human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
  2. Posetrack: A benchmark for human pose estimation and tracking. In CVPR, 2018.
  3. PaddlePaddle Authors. Paddledetection, object detection and instance segmentation toolkit based on paddlepaddle. https://github.com/PaddlePaddle/PaddleDetection.
  4. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. TPAMI, 2019.
  5. Data uncertainty learning in face recognition. In CVPR, 2020.
  6. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR, 2020.
  7. Centernet: Keypoint triplets for object detection. In ICCV, 2019.
  8. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. TPAMI, 2022.
  9. YOLOX: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  10. Bottom-up human pose estimation via disentangled keypoint regression. In CVPR, 2021.
  11. Deep residual learning for image recognition. In CVPR, 2016.
  12. Bounding box regression with uncertainty for accurate object detection. In CVPR, 2019.
  13. Transformer quality in linear time. In ICML, 2022.
  14. Towards understanding action recognition. In ICCV, 2013.
  15. RTMPose: Real-time multi-person pose estimation based on mmpose. arXiv preprint arXiv:2303.07399, 2023.
  16. YOLO by Ultralytics. https://github.com/ultralytics/ultralytics, 2023. Accessed: February 30, 2023.
  17. Parallel feature pyramid network for object detection. In ECCV, 2018.
  18. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR, 2019.
  19. Human pose regression with residual log-likelihood estimation. In ICCV, 2021a.
  20. Generalized focal loss: Towards efficient representation learning for dense object detection. TPAMI, 2022a.
  21. Tokenpose: Learning keypoint tokens for human pose estimation. In ICCV, 2021b.
  22. Simcc: A simple coordinate classification perspective for human pose estimation. In ECCV, 2022b.
  23. Microsoft COCO: Common objects in context. In ECCV, 2014.
  24. Ssd: Single shot multibox detector. In ECCV, 2016.
  25. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  26. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  27. Rethinking the heatmap regression for bottom-up human pose estimation. In CVPR, 2021.
  28. Detrs beat yolos on real-time object detection. arXiv preprint arXiv:2304.08069, 2023.
  29. RTMDet: An empirical study of designing real-time object detectors. arXiv preprint arXiv:2212.07784, 2022.
  30. Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. In CVPR Workshop, 2022.
  31. Fcpose: Fully convolutional multi-person pose estimation with dynamic instance-aware convolutions. In CVPR, 2021.
  32. Poseur: Direct human pose regression with transformers. In ECCV, 2022.
  33. Rethinking keypoint representations: Modeling keypoints and poses as objects for multi-person human pose estimation. In ECCV, 2022.
  34. Single-stage multi-person pose machines. In ICCV, 2019.
  35. Inspose: instance-aware networks for single-stage multi-person pose estimation. In ACMMM, 2021.
  36. End-to-end multi-person pose estimation with transformers. In CVPR, 2022a.
  37. End-to-end multi-person pose estimation with transformers. In CVPR, 2022b.
  38. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
  39. Directpose: Direct end-to-end multi-person pose estimation. arXiv preprint arXiv:1911.07451, 2019a.
  40. Fcos: Fully convolutional one-stage object detection. In CVPR, 2019b.
  41. Attention is all you need. In NeurIPS, 2017.
  42. Contextual instance decoupling for robust multi-person pose estimation. In CVPR, 2022.
  43. Ai challenger: A large-scale dataset for going deeper in image understanding. arXiv preprint arXiv:1711.06475, 2017.
  44. Simple baselines for human pose estimation and tracking. In ECCV, 2018.
  45. Vitpose: Simple vision transformer baselines for human pose estimation. In NeurIPS, 2022.
  46. Explicit box detection unifies end-to-end multi-person pose estimation. In ICLR, 2023.
  47. Transpose: Keypoint localization via transformer. In ICCV, 2021.
  48. HRFormer: High-resolution vision transformer for dense predict. In NeurIPS, 2021.
  49. Mixup: Beyond empirical risk minimization. In ICLR, 2018.
  50. Varifocalnet: An iou-aware dense object detector. In CVPR, 2021.
  51. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Peng Lu (86 papers)
  2. Tao Jiang (274 papers)
  3. Yining Li (29 papers)
  4. Xiangtai Li (128 papers)
  5. Kai Chen (512 papers)
  6. Wenming Yang (71 papers)
Citations (12)

Summary

  • The paper introduces RTMO, a one-stage framework that enhances multi-person pose estimation through dynamic coordinate classification and an MLE-based loss.
  • It achieves a 1.1% AP improvement on the COCO dataset while processing at 141 FPS on a V100 GPU.
  • The approach efficiently balances speed and accuracy within the YOLO architecture, setting a new benchmark for real-time pose estimation.

High-Performance One-Stage Real-Time Multi-Person Pose Estimation: The RTMO Approach

The field of multi-person pose estimation (MPPE) is crucial within computer vision, offering applications that range from augmented reality to precision sports analytics. The increasing demand for real-time processing in these applications poses significant challenges, particularly in reconciling the conflicting demands of speed and accuracy. This paper introduces RTMO, a novel one-stage framework designed to meet these challenges using innovative methods within the YOLO architecture.

Key Innovations and Methodology

RTMO targets the limitations of existing one-stage methods, which often fall short in achieving both high speed and accuracy. The framework utilizes coordinate classification, leveraging dual 1-D heatmaps to achieve precise keypoint localization. Unlike conventional regression approaches, this classification method enhances accuracy by handling spatial ambiguities effectively. Key to this approach is the Dynamic Coordinate Classifier (DCC), introducing dynamic bin allocation and encoding strategies that efficiently utilize bins within localized bounding box regions.

Moreover, the paper proposes a Maximum Likelihood Estimation (MLE) based loss for heatmap learning, enabling the model to adjust task difficulty dynamically. This loss function facilitates the learning process by predicting per-sample uncertainty, balancing the optimization between complex and straightforward samples effectively.

Results and Contributions

The RTMO framework demonstrates superior performance over its counterparts, achieving a 1.1% increase in Average Precision (AP) on the COCO dataset while being approximately nine times faster using the same backbone. The RTMO-l model maintains an impressive 74.8% AP on COCO val2017 and processes at 141 FPS on a V100 GPU. These results position RTMO as a leading solution in one-stage pose estimation, comparable in accuracy to real-time top-down methods but much more efficient in processing speed, especially in multi-person scenarios.

Implications and Future Directions

The practical implications of RTMO are profound, offering a robust tool for real-time applications where both speed and precision are essential. The framework sets a benchmark for future advancements in the field of dense prediction within visual detection tasks. The integration of coordinate classification and dynamic strategies within the YOLO architecture presents promising avenues for future research, inviting further exploration of MLE-based loss frameworks in other areas of computer vision.

In conclusion, RTMO's innovative approach to one-stage pose estimation exemplifies a significant step forward, not only improving current methodologies but also paving the way for continued innovation in the field. As the landscape of computer vision evolves, frameworks like RTMO will undoubtedly influence next-generation systems and applications.

Youtube Logo Streamline Icon: https://streamlinehq.com