Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 44 tok/s

Gemini 2.5 Pro 41 tok/s Pro

GPT-5 Medium 13 tok/s Pro

GPT-5 High 15 tok/s Pro

GPT-4o 86 tok/s Pro

Kimi K2 208 tok/s Pro

GPT OSS 120B 447 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Template-Free Single-View 3D Human Digitalization with Diffusion-Guided LRM (2401.12175v2)

Published 22 Jan 2024 in cs.CV

Abstract: Reconstructing 3D humans from a single image has been extensively investigated. However, existing approaches often fall short on capturing fine geometry and appearance details, hallucinating occluded parts with plausible details, and achieving generalization across unseen and in-the-wild datasets. We present Human-LRM, a diffusion-guided feed-forward model that predicts the implicit field of a human from a single image. Leveraging the power of the state-of-the-art reconstruction model (i.e., LRM) and generative model (i.e Stable Diffusion), our method is able to capture human without any template prior, e.g., SMPL, and effectively enhance occluded parts with rich and realistic details. Our approach first uses a single-view LRM model with an enhanced geometry decoder to get the triplane NeRF representation. The novel view renderings from the triplane NeRF provide strong geometry and color prior, from which we generate photo-realistic details for the occluded parts using a diffusion model. The generated multiple views then enable reconstruction with high-quality geometry and appearance, leading to superior overall performance comparing to all existing human reconstruction methods.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces Human-LRM, a diffusion-guided, template-free method for reconstructing high-fidelity 3D human models from single images.
It integrates an SDF-MLP with RGB-MLP alongside triplane features and NeRF rendering to enhance surface fidelity and handle occlusions.
Experimental evaluations show that Human-LRM outperforms traditional methods in scalability, accuracy, and in-the-wild reconstructions.

Introduction

Computer vision research has advanced significantly in the field of reconstructing 3D human models from 2D images. This development holds immense potential for applications across augmented reality (AR), virtual reality (VR), digital asset creation, and relighting. Traditional methods, however, come with their fair share of limitations, particularly around the requirement for detailed human representations inclusive of clothing.

Method Overview

The paper presents Human-LRM, a pioneering, template-free approach to large scale, feed-forward 3D human digitalization from a single image. The model, trained extensively with multi-view capture and 3D scans, delivers improved adaptability and generalizability compared to prior methods. It leverages Neural Radiance Fields (NeRF) for rendering and predicts geometry and appearance directly from the input image. Critically, the approach introduces a novel training strategy that efficiently distills multi-view reconstruction knowledge to handle single views, thereby enabling it to reconstruct full-bodied characters even from partially-observed images.

Technical Contributions

Human-LRM's key contributions lie in its capacity to generate surfaces with enhanced fidelity. This is achieved by a dual-stage process involving an SDF-MLP (Signed Distance Function - Multi-Layer Perceptron) and an RGB-MLP. The SDF-MLP is responsible for predicting SDF and latent vectors from triplane-queried features, while the RGB-MLP predicts color values. The integration of normal and depth maps further aids in refining geometric prediction quality.

Additionally, a generative component is proposed, grounded in a conditional diffusion model. It first trains a multi-view model to capture a near-perfect triplane representation of the individual subjects, then refines single-view models using this distilled knowledge. This enhancement over deterministic models allows Human-LRM to generate plausible human geometries conditioned solely on partial views.

Evaluation and Discussion

Human-LRM markedly surpasses previous methods in experimental evaluations across several benchmarks. It consistently outperforms parametric and implicit reconstruction methods, even excelling in in-the-wild scenarios with occlusions. The model's adoption of triplane features and NeRF, substituting traditional density functions with SDF, has proven critical for optimizing rendering quality. The trainable diffusion model makes a significant contribution in sizeable occlusion scenarios by providing credible, complete reconstructions from single-view inputs.

The scalability and generalizability of Human-LRM were critically evaluated by supervising training with both ground truth and estimated normal, and depth maps. The findings suggest that while the use of estimated maps provides satisfactory results, supervision with ground truth maps yields better surface details. Training on larger datasets further enhances the model's performance.

Conclusion

The introduction of Human-LRM represents a significant stride forward in the ability to create realistic, detailed digital humans from single images. By meticulously addressing the shortcomings of existing methodologies and building a scalable, adaptable system, Human-LRM sets a new standard for what can be achieved in single-view 3D human digitalization. As this technology continues to develop, it holds promise for a myriad of real-world applications where the digital human form is central.