Emergent Mind

RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

(2407.08634)
Published Jul 11, 2024 in cs.CV

Abstract

Whole-body pose estimation is a challenging task that requires simultaneous prediction of keypoints for the body, hands, face, and feet. Whole-body pose estimation aims to predict fine-grained pose information for the human body, including the face, torso, hands, and feet, which plays an important role in the study of human-centric perception and generation and in various applications. In this work, we present RTMW (Real-Time Multi-person Whole-body pose estimation models), a series of high-performance models for 2D/3D whole-body pose estimation. We incorporate RTMPose model architecture with FPN and HEM (Hierarchical Encoding Module) to better capture pose information from different body parts with various scales. The model is trained with a rich collection of open-source human keypoint datasets with manually aligned annotations and further enhanced via a two-stage distillation strategy. RTMW demonstrates strong performance on multiple whole-body pose estimation benchmarks while maintaining high inference efficiency and deployment friendliness. We release three sizes: m/l/x, with RTMW-l achieving a 70.2 mAP on the COCO-Wholebody benchmark, making it the first open-source model to exceed 70 mAP on this benchmark. Meanwhile, we explored the performance of RTMW in the task of 3D whole-body pose estimation, conducting image-based monocular 3D whole-body pose estimation in a coordinate classification manner. We hope this work can benefit both academic research and industrial applications. The code and models have been made publicly available at: https://github.com/open-mmlab/mmpose/tree/main/projects/rtmpose

3D pose estimation task definition.

Overview

  • The paper introduces RTMW, a series of models aimed at improving real-time multi-person 2D and 3D whole-body pose estimation, building upon the RTMPose framework.

  • Innovative methods like PAFPN and HEM modules are integrated to enhance feature resolution and encoding, significantly improving keypoint prediction for localized body areas.

  • The RTMW models achieve superior performance across benchmarks like COCO-Wholebody and H3WB, demonstrating their robustness, accuracy, and computational efficiency.

RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

In the pursuit of enhancing human-centric artificial intelligence systems, the research presented in the paper "RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation" introduces RTMW, a series of models designed to advance whole-body pose estimation. Whole-body pose estimation is crucial for applications in human-computer interaction, virtual avatar animation, and content generation. This work builds upon the RTMPose model by integrating several innovative methods to address existing limitations in the field.

Technical Contributions

The RTMW model introduces significant architectural enhancements to the RTMPose framework through the integration of PAFPN and HEM modules, aimed at improving feature resolution and encoding. These improvements are pivotal for accurately predicting keypoints in localized body areas such as the face, hands, and feet. Particularly, the HEM module, inspired by hierarchical encoding from VQVAE-2, is employed to enhance the fine-grained detail necessary for comprehensive whole-body pose estimation.

Key technical advances of RTMW include:

  • PAFPN (Part-Aggregation Feature Pyramid Network): Enhances the feature resolution, which is critical for the accurate prediction of localized body parts.
  • HEM (Hierarchical Encoding Module): Improves the encoding of features, particularly benefiting tasks with low-resolution body parts.
  • Joint training on multiple datasets: Unifies annotations from 14 distinct datasets to create a comprehensive training set that includes various body parts, significantly improving model robustness.
  • Two-stage distillation: Enhances performance by refining the learned features through iterative improvements.

Beyond 2D pose estimation, the study also explores 3D whole-body pose estimation (RTMW3D), addressing the complexity of predicting the z-axis using a novel coordinate classification scheme that categorizes depth disparities relative to a predefined root point. This method not only simplifies the z-axis learning problem but also leverages combined training on both 2D and 3D datasets to mitigate dataset limitations.

Numerical Results and Benchmarking

The RTMW models exhibit superior performance across multiple benchmarks. On the COCO-Wholebody benchmark, RTMW-l achieves an impressive 70.2 mAP, marking it as the first open-source model to exceed this threshold. This performance is highlighted across various body parts, including the body, face, hands, and feet, demonstrating the model's robustness and accuracy in fine-grained pose estimation tasks.

In the realm of 3D pose estimation, RTMW3D delivers notable performance on the H3WB test set, achieving an MPJPE of 0.056. This underscores the model's efficacy in extending its capabilities to three-dimensional pose estimation tasks while maintaining computational efficiency.

Implications and Future Directions

The implications of RTMW's development are multifaceted. Practically, the enhanced accuracy and efficiency of RTMW make it highly suitable for real-time applications in diverse fields such as virtual reality, augmented reality, and motion capture technology in the film and gaming industries. The theoretical advancements, particularly in feature resolution and encoding, provide a foundation for future research to build more sophisticated models capable of even greater performance in both 2D and 3D pose estimation.

Future research directions may include further refinement of the hierarchical encoding techniques and expanding the training datasets to include more diverse scenarios and poses. Another potential development area is the optimization of RTMW and RTMW3D for edge devices, making these models more accessible for real-time applications across various platforms.

In conclusion, the RTMW series presents significant advancements in whole-body pose estimation tasks, offering robust models that meet the dual demands of accuracy and real-time performance. The open-source availability of these models is expected to foster further research and industrial application development, contributing to the broader field of human-centric artificial intelligence systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.