DeepCap: Monocular Human Performance Capture Using Weak Supervision (2003.08325v1)

Published 18 Mar 2020 in cs.CV

Abstract: Human performance capture is a highly important computer vision problem with many applications in movie production and virtual/augmented reality. Many previous performance capture approaches either required expensive multi-view setups or did not recover dense space-time coherent geometry with frame-to-frame correspondences. We propose a novel deep learning approach for monocular dense human performance capture. Our method is trained in a weakly supervised manner based on multi-view supervision completely removing the need for training data with 3D ground truth annotations. The network architecture is based on two separate networks that disentangle the task into a pose estimation and a non-rigid surface deformation step. Extensive qualitative and quantitative evaluations show that our approach outperforms the state of the art in terms of quality and robustness.

Citations (204)

View on Semantic Scholar

Summary

The paper introduces a novel dual-network architecture that separates pose estimation and non-rigid deformation to capture detailed human motion.
It employs a differentiable mesh template with a CNN-based feed-forward process, enabling efficient reconstruction of 3D models from 2D inputs.
Extensive evaluations demonstrate improved 3DPCK and MPJPE metrics, underscoring robustness for practical AR/VR applications.

Insights and Implications of "DeepCap: Monocular Human Performance Capture Using Weak Supervision"

The paper "DeepCap: Monocular Human Performance Capture Using Weak Supervision" investigates the challenge of capturing detailed, dense human performance using monocular inputs. This task is pivotal for applications in virtual and augmented reality, telepresence, and personalised virtual avatar generation. The work proposes a novel deep learning technique that enables this capture without the need for extensive 3D ground truth annotations, relying instead on weak supervision via multi-view data.

Key Contributions

Weakly Supervised Learning Architecture: The authors introduce a dual-network architecture which disentangles the task into two separate networks: one for pose estimation and the other for non-rigid surface deformation. This separation allows the model to better capture both articulated movements and surface deformations related to clothing and body shape dynamics.
Innovative Model Parameterization: The method employs a fully differentiable mesh template parameterized with pose and an embedded deformation graph. This approach provides a potent mechanism to extrapolate 3D details from 2D imagery, enhancing the continuity and coherence of the model across time frames.
CNN-Based Approach: Leveraging convolutional neural networks (CNNs), the solution efficiently infers both articulated motions and non-rigid deformations in a single feed-forward process. This efficiency addresses performance bottlenecks found in previous solutions requiring expensive optimization processes post-prediction.
Performance Evaluation: Through extensive evaluations, the authors demonstrate that their approach succeeds in capturing dense and coherent 3D human models from single-view inputs, outperforming current state-of-the-art methods in accuracy and robustness. Quantitative results reflect significant improvements in metrics like percentage of correct keypoints (3DPCK) and mean per joint position error (MPJPE), highlighting effective articulation capture.
Template Utilization: The paper details a method requiring a personalized 3D mesh template for each subject. This template is augmented with motion sequences captured using a multi-view camera setup, which, while only necessary during training, significantly enhances model generalization and capture fidelity in varied poses and environments.

Theoretical and Practical Implications

The proposed methodology offers considerable advantages in contexts where standard multi-view setups are impractical, such as in-the-wild scenarios. By eliminating the dependency on fully annotated 3D data, this approach lowers the barrier to producing high-quality 3D reconstructions, facilitating broader applicability in consumer hardware scenes like smartphones or AR glasses.

Theoretically, this paper advances the discourse on monocular performance capture by aligning deep learning capabilities with practical constraints in controlled and uncontrolled environments. The weak supervision model emphasizes a shift towards efficiency, opening new discussions on the balance between model complexity and computational resource usage in real-time applications.

Future Work

The authors allude to several avenues for future research. One potential direction is to extend the model's capability to capture detailed facial expressions and hand gestures. Another is enhancing the physical realism of clothing and body interactions through more sophisticated multi-layered modeling of soft tissue dynamics.

In summary, "DeepCap: Monocular Human Performance Capture Using Weak Supervision" presents a substantive contribution to computer vision, particularly in human performance capture. The integration of weak supervision within a well-architected CNN framework potentially heralds improved realism and accuracy in creating digital human avatars, with aspirations extending into more nuanced and immersive virtual experiences.