MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild (2406.01595v1)

Published 3 Jun 2024 in cs.CV

Abstract: We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty. To tackle these challenges, we first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces MultiPly, a framework that reconstructs detailed 3D human models from single-view videos using layered neural representations.
The paper details a hybrid instance segmentation and confidence-guided optimization strategy to robustly isolate and refine individual reconstructions in complex scenes.
The paper demonstrates superior performance through extensive evaluations, paving the way for practical applications in AR, VR, and telepresence.

Overview of "MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild"

This paper introduces a novel framework, MultiPly, which addresses the challenging problem of reconstructing multiple people in 3D from monocular videos captured in unpredictable, natural environments. The primary contribution of this research is a comprehensive approach that overcomes the limitations of existing systems, which are either constrained by multi-view setups or tailored for single performers.

By employing a layered neural representation strategy, MultiPly effectively isolates individuals in a scene without any prior subject data. This research demonstrates how instance segmentation and photometric consistency optimization can be individually improved to yield coherent and high-fidelity 3D reconstructions despite the complexities of occlusions and interactions. MultiPly has demonstrated superior results in comparison to existing methods, as highlighted through extensive quantitative evaluations on publicly available datasets and real-world videos.

Core Methodologies

The MultiPly framework is built upon four key innovations:

Unified Temporal Representation: The framework constructs a consistent representation of human shape and texture that remains applicable throughout the video sequence. This allows for a seamless integration of partial observations into a coherent space, ensuring a comprehensive representation of human bodies.
Layered Neural Representation: The central component of the framework involves a layered neural decomposition of the scene into individual neural fields for each person and the background. This decomposition process builds upon the potential of neural implicit functions, accommodating complex scenes with minimal a priori geometric constraints.
Hybrid Instance Segmentation: The proposed segmentation approach leverages both self-supervised 3D scene decomposition and a prompt-driven 2D segmentation module. This hybrid method ensures a robust delineation between closely interacting individuals, overcoming the segmentation challenges posed by severe occlusions.
Confidence-Guided Optimization: By implementing a confidence-based optimization technique, MultiPly dynamically refines human poses and model parameters. This method alternates between optimizing shape and appearance and pose corrections, guided by photometric feedback and consistent confidence measures derived from instance masks.

Contributions and Implications

The research presented in this paper advances the state-of-the-art in several significant ways:

Automatic Reconstruction: The capability to autonomously reconstruct detailed 3D human models from single-view video without supervised 3D scans marks a substantial step forward in making 3D modeling accessible and feasible for broader applications.
Practical Applications: The methodology's utility extends beyond static analysis, offering dynamic interactively accessible 3D scenes that could enhance applications in augmented reality (AR), virtual reality (VR), and telepresence.
Human Modeling: The refined approach to reconstructing multiple individuals with intricate interactions suggests a broader scope for exploiting neural implicit functions across diverse application areas involving dynamic human actions.

Future Directions

This paper paves the way for future investigations into refining neural representations, expanding the framework's applicability to denser crowds, and integrating more nuanced human models like hands and facial details. As neural implicit fields and machine learning methods evolve, additional efforts could further enhance the fidelity and efficiency of human reconstruction in complex environments. Furthermore, the framework has the potential to scale toward novel rendering technologies, incorporating elements such as dynamic lighting and intricate environmental interactions.

In summary, MultiPly represents a significant milestone in 3D human reconstruction, providing a robust and comprehensive method for documenting multiple individuals from a monocular video while maintaining high fidelity even in visually challenging situations. This research could catalyze more sophisticated understandings of human dynamics and interactions in natural settings.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ChenGuo96/status/1801330433118113795