Generalizable One-shot Neural Head Avatar (2306.08768v1)

Published 14 Jun 2023 in cs.CV

Abstract: We present a method that reconstructs and animates a 3D head avatar from a single-view portrait image. Existing methods either involve time-consuming optimization for a specific person with multiple images, or they struggle to synthesize intricate appearance details beyond the facial region. To address these limitations, we propose a framework that not only generalizes to unseen identities based on a single-view image without requiring person-specific optimization, but also captures characteristic details within and beyond the face area (e.g. hairstyle, accessories, etc.). At the core of our method are three branches that produce three tri-planes representing the coarse 3D geometry, detailed appearance of a source image, as well as the expression of a target image. By applying volumetric rendering to the combination of the three tri-planes followed by a super-resolution module, our method yields a high fidelity image of the desired identity, expression and pose. Once trained, our model enables efficient 3D head avatar reconstruction and animation via a single forward pass through a network. Experiments show that the proposed approach generalizes well to unseen validation datasets, surpassing SOTA baseline methods by a large margin on head avatar reconstruction and animation.

Citations (16)

View on Semantic Scholar

Summary

The paper introduces a one-shot neural framework that reconstructs realistic 3D head avatars from a single portrait image without time-consuming person-specific tuning.
It employs a three-branch architecture—canonical, appearance, and expression—to capture coarse geometry, detailed textures, and dynamic expressions.
The method outperforms state-of-the-art techniques with improved metrics like lower L1, LPIPS errors and higher PSNR, SSIM, boosting applications in VR, gaming, and video conferencing.

Generalizable One-shot Neural Head Avatar

The presented work by Li et al., "Generalizable One-shot Neural Head Avatar," tackles the problem of reconstructing and animating a 3D head avatar from a single-view portrait image. This paper provides a comprehensive framework that overcomes the limitations of existing methods by enabling efficient and high-fidelity avatar generation and animation without the need for time-consuming person-specific optimization.

Key Contributions and Methodology

The paper's primary contributions are threefold:

Generalization to Unseen Identities: The framework generalizes to new and unseen identities using a single input image, dispensing with the necessity for extensive person-specific fine-tuning.
Intricate Detail Capturing: The method captures intricate details not only within the facial region but also beyond it, encompassing attributes such as the hairstyle and accessories.
Three-branch Architecture: To fulfill these objectives, the authors introduce a unique three-branch architecture consisting of the canonical branch, appearance branch, and expression branch.

Canonical Branch

The canonical branch focuses on reconstructing the coarse 3D geometry of the face with a neutral expression and frontal pose. Using a pre-trained SegFormer model, the branch maps a 2D input image into a canonical 3D space ensuring alignment regardless of the input's view. This step ensures that the coarse geometry reconstruction facilitates further detailed modeling and efficient generalization.

Appearance Branch

The appearance branch enhances the model's capability to capture intricate details by utilizing a depth map to transfer details from the 2D input image to the 3D reconstruction. By mapping the pixel values of the input image onto the 3D space of the canonical reconstruction, the appearance branch ensures that fine-grained details, including textures and characteristic features, are adequately represented in the final 3D model.

Expression Branch

The expression branch is designed to modify the coarse 3D reconstruction to reflect the target image's expression. It leverages the 3DMM (3D Morphable Model) framework to render a frontal-view image with a target expression. An encoder then maps this rendering to an expression tri-plane, which is combined with the canonical and appearance tri-planes to achieve the desired expression in the output image.

Volumetric Rendering and Super-resolution

The combined output from the three branches undergoes volumetric rendering to synthesize the final image, which is then refined using a super-resolution block. This ensures that the generated image combines high fidelity with computational efficiency.

Numerical Results and Comparative Analysis

The paper reports substantial improvements over state-of-the-art methods. Specifically, the proposed approach yields superior results across multiple metrics:

3D Portrait Reconstruction: The method achieves an L1 distance of 0.015, LPIPS of 0.040, and a PSNR of 28.61 on the CelebA dataset, outperforming existing methods such as ROME and HeadNeRF by significant margins.
Same-identity Reenactment: On the HDTF dataset, the model shows better performance with metrics like PSNR (22.15), SSIM (0.868), and CSIM (0.789), indicating more accurate and realistic avatar animations compared to baselines.
Cross-identity Reenactment: The framework maintains high fidelity and identity preservation with CSIM scores of 0.643 and 0.599 on the HDTF and CelebA datasets respectively.

Implications and Future Directions

From a practical perspective, the proposed method holds substantial promise for applications in video conferencing, gaming, and virtual/augmented reality, where realistic and efficient 3D head avatar generation is crucial. The model's capability to generate high-fidelity animations efficiently can reduce computational costs and improve user experiences in real-time applications.

Theoretically, the paper contributes to advancing the understanding of how neural representations can be effectively leveraged to surpass the limitations of traditional 3DMM and GAN-based methods. The successful integration of volume rendering with a tri-plane architecture and super-resolution techniques paves the way for future research to explore even more sophisticated neural rendering paradigms.

Conclusion

In sum, Li et al.'s framework for generalizable one-shot neural head avatars signifies a notable advancement in the field of computer vision and 3D avatar animation. By addressing the challenges of efficiency and high fidelity, the method sets a new benchmark for future research in realistic and scalable 3D head avatar generation. Future developments could further refine the model to handle dynamic backgrounds, improve the synthesis of inner mouth regions, and enhance the handling of varied input conditions, which remains an open area for ongoing research.

PDF Markdown