Emergent Mind

Abstract

Reconstructing 3D humans from a single image has been extensively investigated. However, existing approaches often fall short on capturing fine geometry and appearance details, hallucinating occluded parts with plausible details, and achieving generalization across unseen and in-the-wild datasets. We present Human-LRM, a diffusion-guided feed-forward model that predicts the implicit field of a human from a single image. Leveraging the power of the state-of-the-art reconstruction model (i.e., LRM) and generative model (i.e Stable Diffusion), our method is able to capture human without any template prior, e.g., SMPL, and effectively enhance occluded parts with rich and realistic details. Our approach first uses a single-view LRM model with an enhanced geometry decoder to get the triplane NeRF representation. The novel view renderings from the triplane NeRF provide strong geometry and color prior, from which we generate photo-realistic details for the occluded parts using a diffusion model. The generated multiple views then enable reconstruction with high-quality geometry and appearance, leading to superior overall performance comparing to all existing human reconstruction methods.

Overview

  • The paper introduces Human-LRM, a novel template-free 3D human digitalization method using single-view images.

  • Human-LRM is trained with multi-view capture and 3D scans for improved adaptability and generalization.

  • The model uses Neural Radiance Fields (NeRF) and a dual-stage SDF-MLP and RGB-MLP process to predict geometry and color.

  • A generative conditional diffusion model is included to handle partially observed images and occlusions.

  • Human-LRM surpasses previous methods in creating realistic digital humans, with potential applications in AR, VR, and digital content creation.

Introduction

Computer vision research has advanced significantly in the realm of reconstructing 3D human models from 2D images. This development holds immense potential for applications across augmented reality (AR), virtual reality (VR), digital asset creation, and relighting. Traditional methods, however, come with their fair share of limitations, particularly around the requirement for detailed human representations inclusive of clothing.

Method Overview

The paper presents Human-LRM, a pioneering, template-free approach to large scale, feed-forward 3D human digitalization from a single image. The model, trained extensively with multi-view capture and 3D scans, delivers improved adaptability and generalizability compared to prior methods. It leverages Neural Radiance Fields (NeRF) for rendering and predicts geometry and appearance directly from the input image. Critically, the approach introduces a novel training strategy that efficiently distills multi-view reconstruction knowledge to handle single views, thereby enabling it to reconstruct full-bodied characters even from partially-observed images.

Technical Contributions

Human-LRM's key contributions lie in its capacity to generate surfaces with enhanced fidelity. This is achieved by a dual-stage process involving an SDF-MLP (Signed Distance Function - Multi-Layer Perceptron) and an RGB-MLP. The SDF-MLP is responsible for predicting SDF and latent vectors from triplane-queried features, while the RGB-MLP predicts color values. The integration of normal and depth maps further aids in refining geometric prediction quality.

Additionally, a generative component is proposed, grounded in a conditional diffusion model. It first trains a multi-view model to capture a near-perfect triplane representation of the individual subjects, then refines single-view models using this distilled knowledge. This enhancement over deterministic models allows Human-LRM to generate plausible human geometries conditioned solely on partial views.

Evaluation and Discussion

Human-LRM markedly surpasses previous methods in experimental evaluations across several benchmarks. It consistently outperforms parametric and implicit reconstruction methods, even excelling in in-the-wild scenarios with occlusions. The model's adoption of triplane features and NeRF, substituting traditional density functions with SDF, has proven critical for optimizing rendering quality. The trainable diffusion model makes a significant contribution in sizeable occlusion scenarios by providing credible, complete reconstructions from single-view inputs.

The scalability and generalizability of Human-LRM were critically evaluated by supervising training with both ground truth and estimated normal, and depth maps. The findings suggest that while the use of estimated maps provides satisfactory results, supervision with ground truth maps yields better surface details. Training on larger datasets further enhances the model's performance.

Conclusion

The introduction of Human-LRM represents a significant stride forward in the ability to create realistic, detailed digital humans from single images. By meticulously addressing the shortcomings of existing methodologies and building a scalable, adaptable system, Human-LRM sets a new standard for what can be achieved in single-view 3D human digitalization. As this technology continues to develop, it holds promise for a myriad of real-world applications where the digital human form is central.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.