Emergent Mind

Abstract

Recent advances in 3D avatar generation have gained significant attentions. These breakthroughs aim to produce more realistic animatable avatars, narrowing the gap between virtual and real-world experiences. Most of existing works employ Score Distillation Sampling (SDS) loss, combined with a differentiable renderer and text condition, to guide a diffusion model in generating 3D avatars. However, SDS often generates oversmoothed results with few facial details, thereby lacking the diversity compared with ancestral sampling. On the other hand, other works generate 3D avatar from a single image, where the challenges of unwanted lighting effects, perspective views, and inferior image quality make them difficult to reliably reconstruct the 3D face meshes with the aligned complete textures. In this paper, we propose a novel 3D avatar generation approach termed UltrAvatar with enhanced fidelity of geometry, and superior quality of physically based rendering (PBR) textures without unwanted lighting. To this end, the proposed approach presents a diffuse color extraction model and an authenticity guided texture diffusion model. The former removes the unwanted lighting effects to reveal true diffuse colors so that the generated avatars can be rendered under various lighting conditions. The latter follows two gradient-based guidances for generating PBR textures to render diverse face-identity features and details better aligning with 3D mesh geometry. We demonstrate the effectiveness and robustness of the proposed method, outperforming the state-of-the-art methods by a large margin in the experiments.

Results showcase high-quality, well-aligned PBR textures in 2D face images generated from text prompts.

Overview

  • UltrAvatar develops an advanced framework for creating realistic 3D avatars using text prompts or images, overcoming lighting and detail preservation issues.

  • The paper evaluates existing avatar generation methods, which rely on extensive datasets or scanning setups, and addresses their limitations regarding diversity and texture detail.

  • A novel Diffuse Color Extraction (DCE) technique is introduced, pinpointing self-attention features' role in separating colors from lighting for authentic avatars.

  • The authenticity guided texture diffusion model (AGT-DM) is proposed to enhance texture quality and diversity in avatars using photometric and edge guidance in the sampling process.

  • Extensive testing shows that UltrAvatar outperforms current state-of-the-art methods, offering improved realism, fidelity, and the ability to generate avatars from various prompts, lighting, and angles.

Introduction

The field of 3D avatar generation represents an intersection of computer vision and computer graphics that has evolved with deep learning. Accurate and realistic generation of 3D avatars from single images or text prompts remains a complex endeavor. Difficulties include eradicating unwanted lighting effects and conserving facial details across different viewpoints. Existing image-to-avatar and text-to-avatar methods either rely on extensive datasets and complex pre-processing or struggle with the effects of occlusion and uncontrolled lighting conditions. Moreover, approaches using Score Distillation Sampling (SDS) in training often result in avatars that lack diversity in texture details.

Previous Work

Reviewing current strategies, the image-to-avatar methodologies frequently hinge on physical setups for detailed scanning, which restrict scalability. These also encompass a range of 3D representations – from the parametric model to neural implicit functions. Generative Adversarial Networks (GANs) have been deftly employed to embed 3D features into generative models, while recent works also utilize text prompts as input for 3D generation, relying on SDS loss for visual consistency. Yet, SDS loss compromises diversity. Moreover, existing guided diffusion models, adaptable via post-training guidance, have leveraged intermediate features for tasks like image editing, highlighting the potential for utilizing the attention features for extracting diffuse colors from a single image and integrating guidance to preserve identity and details.

Methodology

The UltrAvatar framework begins by generating a face image from a textual prompt or using an existing one as input. The novel Diffuse Color Extraction (DCE) removes lighting effects to reveal true colors. Self-attention features within diffusion models assist in eliminating these effects, a pivotal breakthrough allowing for relightable 3D avatars. The process yields undeteriorated diffuse textures, integral for renderings under varying lighting conditions. Then, 3D face meshes are created using a Morphable Model and the authenticity guided texture diffusion model (AGT-DM) to generate complete Physically Based Rendering (PBR) textures. The AGT-DM leverages photometric guidance and edge guidance during its sampling process, ensuring higher diversity and fidelity in generated avatars.

Contributions and Results

The paper's contributions are three-pronged. Firstly, it elucidates the relationship between self-attention features and lighting effects, leading to a robust DCE model that overcomes the challenge of separating colors from lighting. Secondly, it introduces an authenticity guided diffusion model capable of generating superior quality PBR textures. Finally, through extensive experimentation, the paper establishes the UltrAvatar framework's superiority over state-of-the-art techniques in rendering high-quality diverse 3D avatars with sharp, true-to-life details in both observed and unobserved views.

In practice, UltrAvatar efficiently handles a range of prompts, delivering high-quality avatars that maintain fidelity to textual prompts while possessing improved realism and diversity. The framework allows for the generation under various lighting conditions and viewing angles, demonstrating exceptional results in both fidelity and texture details. Ablation studies emphasize the significance of each proposed component, with the combined effect of photometric and edge guidance reflecting markedly in the generation of nuanced facial features. Furthermore, the framework exhibits versatility in generating out-of-domain characters, affirming its adaptability.

In conclusion, UltrAvatar marks a significant step forward, suggesting that the successful generation of lifelike, animatable 3D avatars from simple inputs is within grasp. As these avatars become increasingly indistinguishable from real humans and seamlessly responsive to various conditions, they affirm the potential to revolutionize not just gaming and virtual reality but also broader domains where digital human presence is pivotal.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.