- The paper introduces Human-LRM, a diffusion-guided, template-free method for reconstructing high-fidelity 3D human models from single images.
- It integrates an SDF-MLP with RGB-MLP alongside triplane features and NeRF rendering to enhance surface fidelity and handle occlusions.
- Experimental evaluations show that Human-LRM outperforms traditional methods in scalability, accuracy, and in-the-wild reconstructions.
Introduction
Computer vision research has advanced significantly in the field of reconstructing 3D human models from 2D images. This development holds immense potential for applications across augmented reality (AR), virtual reality (VR), digital asset creation, and relighting. Traditional methods, however, come with their fair share of limitations, particularly around the requirement for detailed human representations inclusive of clothing.
Method Overview
The paper presents Human-LRM, a pioneering, template-free approach to large scale, feed-forward 3D human digitalization from a single image. The model, trained extensively with multi-view capture and 3D scans, delivers improved adaptability and generalizability compared to prior methods. It leverages Neural Radiance Fields (NeRF) for rendering and predicts geometry and appearance directly from the input image. Critically, the approach introduces a novel training strategy that efficiently distills multi-view reconstruction knowledge to handle single views, thereby enabling it to reconstruct full-bodied characters even from partially-observed images.
Technical Contributions
Human-LRM's key contributions lie in its capacity to generate surfaces with enhanced fidelity. This is achieved by a dual-stage process involving an SDF-MLP (Signed Distance Function - Multi-Layer Perceptron) and an RGB-MLP. The SDF-MLP is responsible for predicting SDF and latent vectors from triplane-queried features, while the RGB-MLP predicts color values. The integration of normal and depth maps further aids in refining geometric prediction quality.
Additionally, a generative component is proposed, grounded in a conditional diffusion model. It first trains a multi-view model to capture a near-perfect triplane representation of the individual subjects, then refines single-view models using this distilled knowledge. This enhancement over deterministic models allows Human-LRM to generate plausible human geometries conditioned solely on partial views.
Evaluation and Discussion
Human-LRM markedly surpasses previous methods in experimental evaluations across several benchmarks. It consistently outperforms parametric and implicit reconstruction methods, even excelling in in-the-wild scenarios with occlusions. The model's adoption of triplane features and NeRF, substituting traditional density functions with SDF, has proven critical for optimizing rendering quality. The trainable diffusion model makes a significant contribution in sizeable occlusion scenarios by providing credible, complete reconstructions from single-view inputs.
The scalability and generalizability of Human-LRM were critically evaluated by supervising training with both ground truth and estimated normal, and depth maps. The findings suggest that while the use of estimated maps provides satisfactory results, supervision with ground truth maps yields better surface details. Training on larger datasets further enhances the model's performance.
Conclusion
The introduction of Human-LRM represents a significant stride forward in the ability to create realistic, detailed digital humans from single images. By meticulously addressing the shortcomings of existing methodologies and building a scalable, adaptable system, Human-LRM sets a new standard for what can be achieved in single-view 3D human digitalization. As this technology continues to develop, it holds promise for a myriad of real-world applications where the digital human form is central.