Learning the Depths of Moving People by Watching Frozen People (1904.11111v1)

Published 25 Apr 2019 in cs.CV

Abstract: We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving. Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects' motion and may only recover sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a new source of data: thousands of Internet videos of people imitating mannequins, i.e., freezing in diverse, natural poses, while a hand-held camera tours the scene. Because people are stationary, training data can be generated using multi-view stereo reconstruction. At inference time, our method uses motion parallax cues from the static areas of the scenes to guide the depth prediction. We demonstrate our method on real-world sequences of complex human actions captured by a moving hand-held camera, show improvement over state-of-the-art monocular depth prediction methods, and show various 3D effects produced using our predicted depth.

Citations (234)

View on Semantic Scholar

Summary

The paper introduces a novel method to predict depth for moving humans in monocular video by training on a new dataset derived from static scenes with stationary people.
The proposed deep learning model is trained on the MannequinChallenge (MC) dataset, leveraging multi-view stereo depth generated from 'frozen' video segments.
Quantitative and qualitative results demonstrate that the model surpasses state-of-the-art techniques, significantly improving depth accuracy for dynamic human subjects and enabling applications like augmented reality.

Learning the Depths of Moving People by Watching Frozen People

This technical paper presents a method aimed at advancing the field of monocular depth prediction, specifically targeting scenarios with moving cameras and dynamic scenes involving human subjects. The authors propose an innovative data-driven approach to learn depth priors from a newly composed dataset termed the MannequinChallenge (MC) dataset. This dataset consists of online videos where individuals emulate mannequin-like stillness while a camera navigates the scene, effectively circumventing the challenges of capturing depth for moving objects.

Methodology Overview

The core contribution of this research is a deep learning model trained on the MC dataset. The model utilizes data from thousands of Internet videos, where geometric constraints allow for the application of multi-view stereo (MVS) methods to estimate depth when objects, in this case humans, are stationary. The authors leverage this scenario to generate training data, enabling the model to learn depth prediction in dynamic settings.

The input to the model includes:

An RGB reference image.
A binary mask delineating human regions.
An initial depth map for non-human areas derived from motion parallax between two video frames.
A confidence map, which assists in the reliability evaluation of the input parallax-based depth, addressing challenges such as flow field noise and regions with inadequate parallax.

The effective training supervision comes from depth maps obtained via multi-view stereo, with stringent cleaning procedures to ensure quality data.

Results and Discussion

Quantitative assessments on both the MC test set and the TUM RGBD dataset reveal that the proposed model surpasses state-of-the-art depth prediction techniques. The full model with comprehensive inputs shows superior performance compared to simple single-image and two-frame baseline configurations. Particularly, the model significantly enhances depth estimation accuracy for human subjects and the environment and maintains depth consistency between humans and their surroundings.

Qualitative evaluations in challenging sequences involving complex human actions, demonstrate the model's capacity to predict plausible depths and produce visually compelling 3D effects such as synthetic depth-of-field, depth-aware inpainting, and accurate occlusion management when inserting virtual objects into scenes.

Implications and Future Work

This work extends the frontier of depth prediction in monocular video, pivoting away from typical assumptions about object rigidity and enhancing the generalization capabilities of depth prediction systems when confronted with dynamic human activity. The derived MC dataset opens new opportunities for further refinement and versatility of computational depth perception models.

Practical applications could span across realms such as augmented reality, where understanding the spatial configuration of environments involving moving humans is crucial. However, current reliance on known camera poses underscores a limitation that could be addressed with advancements in robust visual-inertial odometry systems. Furthermore, incorporating multi-frame information might help in temporal coherence across videos, enhancing the stability and accuracy of the depth predictions.

Future developments might also explore generalization to other domains with non-human dynamic scenes, potentially through synthetic video generation or more diverse real-world datasets. As the dataset and models are made available to the research community, further experimentation and improvements in depth prediction under dynamic conditions are anticipated.

Related Papers

YouTube

Show All Videos