3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal (2207.11061v1)

Published 22 Jul 2022 in cs.CV

Abstract: Estimating 3D interacting hand pose from a single RGB image is essential for understanding human actions. Unlike most previous works that directly predict the 3D poses of two interacting hands simultaneously, we propose to decompose the challenging interacting hand pose estimation task and estimate the pose of each hand separately. In this way, it is straightforward to take advantage of the latest research progress on the single-hand pose estimation system. However, hand pose estimation in interacting scenarios is very challenging, due to (1) severe hand-hand occlusion and (2) ambiguity caused by the homogeneous appearance of hands. To tackle these two challenges, we propose a novel Hand De-occlusion and Removal (HDR) framework to perform hand de-occlusion and distractor removal. We also propose the first large-scale synthetic amodal hand dataset, termed Amodal InterHand Dataset (AIH), to facilitate model training and promote the development of the related research. Experiments show that the proposed method significantly outperforms previous state-of-the-art interacting hand pose estimation approaches. Codes and data are available at https://github.com/MengHao666/HDR.

Citations (36)

View on Semantic Scholar

Summary

The paper introduces a novel HDR framework that decomposes hand pose estimation into modules for amodal segmentation, de-occlusion, and removal.
It leverages a single hand pose estimator to overcome challenges posed by occlusion and left-right ambiguity in interacting hands.
Quantitative results on the new AIH dataset demonstrate significant reduction in MPJPE compared to state-of-the-art two-hand methods.

3D Interacting Hand Pose Estimation via Hand De-occlusion and Removal

The paper introduces an innovative approach to estimating 3D interacting hand poses from a single RGB image using a framework termed Hand De-occlusion and Removal (HDR). This problem is of significant importance for numerous applications, including human-computer interaction and augmented reality. The complexity arises chiefly from (1) occlusion during hand interactions and (2) ambiguities due to the similar appearance of left and right hands. Previous methods have predominantly approached the issue by directly predicting the 3D poses of both hands in tandem, which typically leads to challenges in handling occlusions and ambiguities.

HDR Framework

The authors propose decomposing the hand pose estimation task, estimating each hand separately, thus leveraging recent advancements in single-hand pose estimation. The HDR framework comprises three crucial components: Hand Amodal Segmentation Module (HASM), Hand De-occlusion and Removal Module (HDRM), and Single Hand Pose Estimator (SHPE).

HASM generates both amodal and visible segmentation masks for each hand. These masks are critical for understanding occluded parts and distracting elements, essential for the subsequent de-occlusion and removal processes.
HDRM focuses on mitigating occlusion and confusion caused by similar appearances through de-occlusion and removal processes. The module reconstructs occluded hand parts and removes the distracting hand to simplify the input to the hand pose estimator.
SHPE is an existing single-hand pose estimator, which benefits from the simplified input, overcoming issues related to occlusion and ambiguity.

Quantitative evaluations demonstrate that the HDR framework significantly outperforms existing state-of-the-art methods. Specifically, it reduces the mean per-joint position error (MPJPE) considerably, showing marked improvements over traditional two-hand methods on the InterHand2.6M dataset, especially in challenging scenarios involving heavy interaction.

Creation and Utilization of Amodal InterHand Dataset (AIH)

This paper also introduces the Amodal InterHand Dataset (AIH), which is seminal in training the proposed modules. AIH is synthetically generated and comprises two sub-datasets: AIH_Syn and AIH_Render.

AIH_Syn employs copy-and-paste techniques to generate realistic-looking hand interactions, preserving the biomechanical structure, albeit sometimes leading to unnatural hand positions.
AIH_Render offers greater fidelity in physical interactions by rendering hand meshes, though it's prone to appearance gaps due to synthetic textures.

The combination of these datasets ensures comprehensive coverage of hand poses and interactions, facilitating robust model training.

Implications and Future Directions

The HDR framework and AIH dataset collectively push the boundaries of hand pose estimation amidst interaction and occlusion. The success of this approach hinges on effectively applying de-occlusion and removal cues, showcasing a novel use of amodal perception for practical purposes in 3D vision tasks. Future work could explore further improvements in image recovery quality and integrating more advanced modules as the building blocks for HDR, setting the stage for broader applications and advancements in AI interactions involving complex human gestures. This research is poised to influence future developments in AI, obstacle avoidance in robotics, and immersive AR/VR experiences. The ongoing challenge is to refine these models for better generalization across diverse environments, enhancing their applicability and robustness in unconstrained real-world scenarios.

PDF Markdown

Related Papers

GitHub

GitHub - MengHao666/HDR: Official code and data for HDR ( ECCV 2022) (102 stars)