Learning 3D Human Dynamics from Video (1812.01601v4)

Published 4 Dec 2018 in cs.CV

Abstract: From an image of a person in action, we can easily guess the 3D motion of the person in the immediate past and future. This is because we have a mental model of 3D human dynamics that we have acquired from observing visual sequences of humans in motion. We present a framework that can similarly learn a representation of 3D dynamics of humans from video via a simple but effective temporal encoding of image features. At test time, from video, the learned temporal representation give rise to smooth 3D mesh predictions. From a single image, our model can recover the current 3D mesh as well as its 3D past and future motion. Our approach is designed so it can learn from videos with 2D pose annotations in a semi-supervised manner. Though annotated data is always limited, there are millions of videos uploaded daily on the Internet. In this work, we harvest this Internet-scale source of unlabeled data by training our model on unlabeled video with pseudo-ground truth 2D pose obtained from an off-the-shelf 2D pose detector. Our experiments show that adding more videos with pseudo-ground truth 2D pose monotonically improves 3D prediction performance. We evaluate our model, Human Mesh and Motion Recovery (HMMR), on the recent challenging dataset of 3D Poses in the Wild and obtain state-of-the-art performance on the 3D prediction task without any fine-tuning. The project website with video, code, and data can be found at https://akanazawa.github.io/human_dynamics/.

Authors (4)

Angjoo Kanazawa (84 papers)
Jason Y. Zhang (14 papers)
Panna Felsen (4 papers)
Jitendra Malik (211 papers)

Citations (481)

View on Semantic Scholar

Summary

The paper introduces a novel framework that predicts past, present, and future human motions by learning temporal dynamics from videos.
It employs a semi-supervised approach combining fully labeled, 2D-labeled, and pseudo-ground truth data to enhance 3D pose estimation.
Results demonstrate state-of-the-art temporal smoothness with significant reductions in acceleration error on benchmark datasets.

Learning 3D Human Dynamics from Video

The paper "Learning 3D Human Dynamics from Video" by Kanazawa et al. introduces a methodology for the automated reconstruction and prediction of 3D human dynamics using video data. The framework, Human Mesh and Motion Recovery (HMMR), is designed to learn a temporal representation of human movement from video sequences, enabling the prediction of human poses and dynamics in 3D. This approach leverages both labeled and unlabeled data, aiming to address challenges in 3D human pose estimation when ground truth annotations are sparse.

Core Contributions

The authors propose a temporal encoding strategy that learns 3D human dynamics by predicting not only current poses but also future and past motions. The methodology utilizes a 1D temporal encoder to derive features from video frames, which can be used to generate smooth 3D mesh predictions. Such predictive capabilities are possible due to the model's ability to ingest large volumes of unlabeled video data, using pseudo-ground truth 2D pose information obtained from off-the-shelf 2D pose detectors.

Methodology

Key to the paper's approach is the semi-supervised learning framework that allows for training on different levels of data granularity:

Fully Labeled Data: The model benefits from sequences with complete 3D and 2D pose annotations.
2D Pose Labeled Data: Videos annotated with only 2D information contribute by minimizing reprojection errors.
Unlabeled Data via Pseudo-ground Truth: The paper exploits the abundance of internet videos by incorporating pseudo-ground truth 2D annotations, using state-of-the-art pose detection.

The HMMR framework consists of a temporal encoder, 3D pose and shape regressors, and a hallucinator to estimate motion dynamics from static images. The model incorporates regularization techniques to maintain shape consistency across frames while employing adversarial priors for realistic human pose generation.

Results and Observations

The paper demonstrates the efficacy of the proposed method on the 3D Poses in the Wild dataset, achieving state-of-the-art results without requiring fine-tuning. The results illustrate significant reductions in acceleration error—a key metric in assessing temporal smoothness of predictions—when compared to single-view models. Moreover, the model's performance improves consistently with increased exposure to pseudo-ground truth data, indicating the utility of unlabeled video data in enhancing 3D pose estimation.

Implications and Future Work

This work has substantial implications for the advancement of video-based human motion capture techniques. In practical terms, the HMMR framework could enhance applications in domains such as animation, human-computer interaction, and sports analytics. Theoretically, the paper contributes to the understanding of leveraging unlabeled data via semi-supervised learning paradigms, particularly for high-dimensional tasks like 3D human motion prediction.

Future work could explore further refinement of dynamics prediction accuracy, enhancement of data labeling techniques, and adaptation to environments with multiple interacting subjects. Extensions may also consider integrating constraints to better address occlusions and partial visibility in crowded scenes. The approach highlights the growing potential of unsupervised data as a catalyst for advancing machine learning models in complex environments.

Overall, the paper provides a significant step forward in utilizing video data for 3D human dynamics learning, paving the way for more generalized and scalable human pose estimation systems.

PDF Markdown

Related Papers

GitHub

Learning 3D Human Dynamics from Video