RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation (2312.07526v2)

Published 12 Dec 2023 in cs.CV

Abstract: Real-time multi-person pose estimation presents significant challenges in balancing speed and precision. While two-stage top-down methods slow down as the number of people in the image increases, existing one-stage methods often fail to simultaneously deliver high accuracy and real-time performance. This paper introduces RTMO, a one-stage pose estimation framework that seamlessly integrates coordinate classification by representing keypoints using dual 1-D heatmaps within the YOLO architecture, achieving accuracy comparable to top-down methods while maintaining high speed. We propose a dynamic coordinate classifier and a tailored loss function for heatmap learning, specifically designed to address the incompatibilities between coordinate classification and dense prediction models. RTMO outperforms state-of-the-art one-stage pose estimators, achieving 1.1% higher AP on COCO while operating about 9 times faster with the same backbone. Our largest model, RTMO-l, attains 74.8% AP on COCO val2017 and 141 FPS on a single V100 GPU, demonstrating its efficiency and accuracy. The code and models are available at https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo.

References (51)

Authors (6)

Peng Lu (86 papers)
Tao Jiang (274 papers)
Yining Li (29 papers)
Xiangtai Li (128 papers)
Kai Chen (512 papers)
Wenming Yang (71 papers)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces RTMO, a one-stage framework that enhances multi-person pose estimation through dynamic coordinate classification and an MLE-based loss.
It achieves a 1.1% AP improvement on the COCO dataset while processing at 141 FPS on a V100 GPU.
The approach efficiently balances speed and accuracy within the YOLO architecture, setting a new benchmark for real-time pose estimation.

High-Performance One-Stage Real-Time Multi-Person Pose Estimation: The RTMO Approach

The field of multi-person pose estimation (MPPE) is crucial within computer vision, offering applications that range from augmented reality to precision sports analytics. The increasing demand for real-time processing in these applications poses significant challenges, particularly in reconciling the conflicting demands of speed and accuracy. This paper introduces RTMO, a novel one-stage framework designed to meet these challenges using innovative methods within the YOLO architecture.

Key Innovations and Methodology

RTMO targets the limitations of existing one-stage methods, which often fall short in achieving both high speed and accuracy. The framework utilizes coordinate classification, leveraging dual 1-D heatmaps to achieve precise keypoint localization. Unlike conventional regression approaches, this classification method enhances accuracy by handling spatial ambiguities effectively. Key to this approach is the Dynamic Coordinate Classifier (DCC), introducing dynamic bin allocation and encoding strategies that efficiently utilize bins within localized bounding box regions.

Moreover, the paper proposes a Maximum Likelihood Estimation (MLE) based loss for heatmap learning, enabling the model to adjust task difficulty dynamically. This loss function facilitates the learning process by predicting per-sample uncertainty, balancing the optimization between complex and straightforward samples effectively.

Results and Contributions

The RTMO framework demonstrates superior performance over its counterparts, achieving a 1.1% increase in Average Precision (AP) on the COCO dataset while being approximately nine times faster using the same backbone. The RTMO-l model maintains an impressive 74.8% AP on COCO val2017 and processes at 141 FPS on a V100 GPU. These results position RTMO as a leading solution in one-stage pose estimation, comparable in accuracy to real-time top-down methods but much more efficient in processing speed, especially in multi-person scenarios.

Implications and Future Directions

The practical implications of RTMO are profound, offering a robust tool for real-time applications where both speed and precision are essential. The framework sets a benchmark for future advancements in the field of dense prediction within visual detection tasks. The integration of coordinate classification and dynamic strategies within the YOLO architecture presents promising avenues for future research, inviting further exploration of MLE-based loss frameworks in other areas of computer vision.

In conclusion, RTMO's innovative approach to one-stage pose estimation exemplifies a significant step forward, not only improving current methodologies but also paving the way for continued innovation in the field. As the landscape of computer vision evolves, frameworks like RTMO will undoubtedly influence next-generation systems and applications.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/OpenMMLab/status/1743119611598713009

https://twitter.com/knishimae0531/status/1743481167800918051

YouTube

Show All Videos