Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes (2404.01543v1)

Published 2 Apr 2024 in cs.CV and cs.GR

Abstract: 3D head avatars built with neural implicit volumetric representations have achieved unprecedented levels of photorealism. However, the computational cost of these methods remains a significant barrier to their widespread adoption, particularly in real-time applications such as virtual reality and teleconferencing. While attempts have been made to develop fast neural rendering approaches for static scenes, these methods cannot be simply employed to support realistic facial expressions, such as in the case of a dynamic facial performance. To address these challenges, we propose a novel fast 3D neural implicit head avatar model that achieves real-time rendering while maintaining fine-grained controllability and high rendering quality. Our key idea lies in the introduction of local hash table blendshapes, which are learned and attached to the vertices of an underlying face parametric model. These per-vertex hash-tables are linearly merged with weights predicted via a CNN, resulting in expression dependent embeddings. Our novel representation enables efficient density and color predictions using a lightweight MLP, which is further accelerated by a hierarchical nearest neighbor search method. Extensive experiments show that our approach runs in real-time while achieving comparable rendering quality to state-of-the-arts and decent results on challenging expressions.

References (46)

Citations (1)

View on Semantic Scholar

Summary

The paper presents a novel real-time rendering approach for photorealistic 3D head avatars using mesh-anchored hash table blendshapes.
It employs a lightweight MLP and hierarchical k-NN search to efficiently blend vertex-level deformations, achieving over 30 FPS at 512x512 resolution.
The method demonstrates superior facial expression control and visual quality, outpacing traditional implicit volumetric techniques for interactive applications.

Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes

Introduction to Mesh-anchored Hash Table Blendshapes

The creation of photorealistic human avatars has seen significant advancements with the adoption of neural implicit volumetric representations. However, the computational demands of existing methods restrict their application in real-time scenarios, such as virtual reality or teleconferencing. This paper introduces a novel approach to constructing 3D neural implicit head avatars which importantly combines real-time rendering capabilities without compromising on the visual fidelity and controllability required for dynamic facial expressions.

The cornerstone of this method is the development of local hash table blendshapes, which are strategically integrated with the vertices of a face parametric model. These blendshapes operate at a vertex level, allowing for more nuanced and localized facial expressions by linearly merging embeddings produced by a convolutional neural network. The adoption of a lightweight Multilayer Perceptron (MLP) alongside a hierarchical nearest neighbor search method forms the basis for efficient density and color predictions, enabling real-time rendering.

Mesh-anchored Hash Table Blendshapes: The Core Representation

The model employs mesh-anchored hash table blendshapes where multiple, smaller hash tables are associated with the vertices of a 3D morphable model (3DMM). This ensures that each vertex's local deformations significantly influence the surrounding area, enhancing the granularity of expressions and overall model expressiveness. The real-time rendering is made possible by the use of a very lightweight MLP, with the acceleration attributed to hierarchical k-nearest-neighbor searches for embedding retrieval.

Hierarchical $k$ -nearest-neighbor Search

To further expedite the rendering speed, this work introduces a novel hierarchical $k$ -nearest-neighbor (k-NN) search strategy. By organizing query points into clusters, the method efficiently narrows down the search space for neighbor vertices, which critically contributes to achieving real-time rendering speeds without sacrificing visual quality.

Experimental Validation and Results

The proposed approach consistently outperforms prior methods in rendering speed, achieving over 30 frames per second at a 512x512 resolution in real-time scenarios. This is while maintaining comparable—or in certain cases, superior—visual quality with state-of-the-art high-quality 3D avatars. The experiments underscore the model's aptitude in rendering challenging expressions more accurately than current efficient avatars, establishing it as a significant advancement in the field.

Theoretical and Practical Implications

This model's innovative blend of efficiency, quality, and controllability heralds a new direction for the development of 3D head avatars, particularly for real-time applications. The method elegantly circumvents the computational challenges traditionally associated with neural implicit models, without compromising on the ability to generate dynamic, photorealistic facial expressions. Further exploration into optimizing this approach could lead to broader applications, including more immersive virtual reality experiences and more realistic telepresence in video conferencing.

Conclusion

This paper presents a groundbreaking method for creating high-fidelity, controllable 3D head avatars capable of real-time rendering. The introduction of mesh-anchored hash table blendshapes, combined with a hierarchical k-NN search, represents a significant technological advancement, pushing the boundaries of what is possible in the domain of virtual human representation. Future work will undoubtedly build on this foundation, exploring new realms of efficiency and realism in digital human modeling.