Self-supervised Learning of Motion Capture (1712.01337v1)

Published 4 Dec 2017 in cs.CV

Abstract: Current state-of-the-art solutions for motion capture from a single camera are optimization driven: they optimize the parameters of a 3D human model so that its re-projection matches measurements in the video (e.g. person segmentation, optical flow, keypoint detections etc.). Optimization models are susceptible to local minima. This has been the bottleneck that forced using clean green-screen like backgrounds at capture time, manual initialization, or switching to multiple cameras as input resource. In this work, we propose a learning based motion capture model for single camera input. Instead of optimizing mesh and skeleton parameters directly, our model optimizes neural network weights that predict 3D shape and skeleton configurations given a monocular RGB video. Our model is trained using a combination of strong supervision from synthetic data, and self-supervision from differentiable rendering of (a) skeletal keypoints, (b) dense 3D mesh motion, and (c) human-background segmentation, in an end-to-end framework. Empirically we show our model combines the best of both worlds of supervised learning and test-time optimization: supervised learning initializes the model parameters in the right regime, ensuring good pose and surface initialization at test time, without manual effort. Self-supervision by back-propagating through differentiable rendering allows (unsupervised) adaptation of the model to the test data, and offers much tighter fit than a pretrained fixed model. We show that the proposed model improves with experience and converges to low-error solutions where previous optimization methods fail.

Citations (318)

View on Semantic Scholar

Summary

The paper proposes a neural framework for 3D motion capture that combines supervised pretraining with self-supervised fine-tuning via differentiable rendering.
The method minimizes reconstruction errors by leveraging keypoint, motion, and segmentation reprojection losses to outperform traditional optimization approaches.
The approach enables adaptable motion capture from monocular video, opening new avenues for applications in training, rehabilitation, and interactive entertainment.

Self-supervised Learning of Motion Capture: A Comprehensive Analysis

The paper "Self-supervised Learning of Motion Capture," authored by Hsiao-Yu Fish Tung et al., proposes a novel approach to 3D motion capture from monocular video input, pivoting from the traditional optimization-driven methods to a learning-based framework. This shift addresses a fundamental limitation in existing solutions, primarily their susceptibility to local minima that necessitates controlled environments or multi-camera setups to function effectively.

Methodology

The researchers introduce a neural model that predicts 3D human shapes and skeletal configurations directly from RGB video input. The architectural innovation lies in the integration of supervised learning with self-supervised fine-tuning. The model pretrains using large-scale synthetic datasets to provide robust initial estimates for human poses. Subsequently, the model applies self-supervision via end-to-end differentiable rendering, utilizing errors in the reprojection of keypoints, dense 3D mesh motion, and segmentation to refine those estimates dynamically.

Key to this process is the use of both supervised and unsupervised components:

Supervised Learning: Initially grounded in synthetic data, the model leverages paired supervision to predict key skeletal and shape parameters.
Self-supervised Learning: Building upon the pretrained parameters, self-supervised adaptation occurs through differentiable reprojection losses. These include:
- Keypoint re-projection error: Calculating error between projected 3D joint configurations and observed 2D keypoints.
- Motion re-projection error: Differentiably matching 3D motion vectors against 2D optical flow estimates.
- Segmentation re-projection error: Evaluating how accurately projected segments of the 3D mesh align with 2D segmentations detected in the video.

Empirical Findings

The empirical results demonstrate significant advancements over traditional optimization methods. Notably, the proposed model shows superior performance in both controlled synthetic environments (Surreal dataset) and real-world scenarios (Human3.6M dataset), highlighting its capability to adapt across diverse data landscapes.

Surface and Skeletal Precision: The model achieves reduced reconstruction error rates when compared against both direct optimization techniques and models devoid of self-supervised refinement, indicating its enhanced capability to capture human motion accurately.
Adaptability: The model significantly benefits from self-supervised fine-tuning, showing the ability to improve performance continually with incoming data, contrasting with static pretrained models.

Implications and Future Directions

From a practical standpoint, this approach opens avenues for deploying monocular video-based motion capture in various applications such as automated training systems, rehabilitation, and interactive entertainment, without the overhead of controlled settings or extensive manual calibration.

Theoretically, this integration of differentiable rendering and self-supervised learning suggests a promising framework for broader applications beyond human tracking — potentially to any articulated or deformable object. Future work could explore iterative feedback mechanisms and incorporate residual, free-form deformations to further refine and extend shape representations, as well as adapting the methodology to a wider range of subjects and environments.

In conclusion, this paper makes a substantial contribution to the motion capture field by transitioning from static, optimization-based approaches to adaptive, self-learning systems. As this line of inquiry progresses, it could lead to significant advancements in AI-driven perception and understanding of 3D environments.

PDF Markdown