Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 64 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 22 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 78 tok/s Pro

Kimi K2 211 tok/s Pro

GPT OSS 120B 458 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering (2303.14662v1)

Published 26 Mar 2023 in cs.CV and cs.AI

Abstract: Controllability, generalizability and efficiency are the major objectives of constructing face avatars represented by neural implicit field. However, existing methods have not managed to accommodate the three requirements simultaneously. They either focus on static portraits, restricting the representation ability to a specific subject, or suffer from substantial computational cost, limiting their flexibility. In this paper, we propose One-shot Talking face Avatar (OTAvatar), which constructs face avatars by a generalized controllable tri-plane rendering solution so that each personalized avatar can be constructed from only one portrait as the reference. Specifically, OTAvatar first inverts a portrait image to a motion-free identity code. Second, the identity code and a motion code are utilized to modulate an efficient CNN to generate a tri-plane formulated volume, which encodes the subject in the desired motion. Finally, volume rendering is employed to generate an image in any view. The core of our solution is a novel decoupling-by-inverting strategy that disentangles identity and motion in the latent code via optimization-based inversion. Benefiting from the efficient tri-plane representation, we achieve controllable rendering of generalized face avatar at $35$ FPS on A100. Experiments show promising performance of cross-identity reenactment on subjects out of the training set and better 3D consistency.

Citations (47)

View on Semantic Scholar

Summary

The paper introduces a novel framework that inverts a single portrait to extract a motion-free identity code for reliable avatar construction.
It employs a tri-plane volume generation with a CNN and innovative volume rendering to dynamically portray expressions and views.
Experimental results show 35 FPS performance and superior cross-identity reenactment, highlighting its practical applicability.

One-shot Talking Face Avatar with Controllable Tri-plane Rendering

The paper "OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering" addresses the persistent challenges in building deployable talking face avatars, particularly those related to controllability, generalizability, and computational efficiency. Existing methodologies typically fall short of meeting these criteria simultaneously, either being limited to static portraits or burdened with high computational costs. The proposed OTAvatar framework is designed to overcome these limitations by enabling the construction of face avatars using a generalized, controllable tri-plane rendering approach.

Methodology

OTAvatar employs a sophisticated yet efficient architecture that constructs an animated face avatar from only a single reference portrait. The process is executed in three key phases:

Identity Inversion: A reference portrait is inverted to extract a motion-free identity code. This identity code acts as the foundational characteristic of the avatar.
Tri-Plane Volume Generation: Utilizing both the fixed identity code and an input motion code, the system employs a convolutional neural network (CNN) to generate a tri-plane volume. This volume serves as a dynamic representation of the face avatar, reflecting desired expressions and motions.
Volume Rendering: The tri-plane volume is rendered into a two-dimensional image, which can be viewed from any desired angle or camera perspective. This is achieved through novel volume rendering techniques that translate 3D information into 2D outputs.

The core novelty lies in the decoupling-by-inverting strategy used in latent space, which efficiently disentangles identity and motion attributes. This enables high-quality reenactment of expressions and poses that were not present in the original training dataset.

Results

Quantitative evaluations demonstrate that OTAvatar excels in several rigorous tests, displaying remarkable cross-identity reenactment abilities with high 3D consistency. The experiments show that the system can operate at real-time speeds (35 FPS on A100 GPUs), indicating its practical applicability in scenarios that demand immediate visual feedback. These numerical results surpass those of preceding models both in fidelity and efficiency.

Implications

The implications of OTAvatar's capabilities are significant for both academic and industrial applications. The successful manipulation of motion and expression with minimal input opens new avenues for research in human-computer interaction, virtual reality, and synthetic media content creation. It also demonstrates the potential for deploying realistic avatars in real-world applications, such as personalized digital assistants or entertainers.

Future Directions

Looking forward, this work sets the stage for several potential developments:

Increased Personalization: Future research may focus on refining identity representations to capture even subtler individual features and characteristics.
Enhanced Realism: Efforts could be made to improve not just the visual fidelity but also the synchronized audio-visual expressions, incorporating more attributes from generative models.
Scalability and Deployment: Broader scalability could be explored, focusing on minimizing computational overhead and extending applications to mobile and resource-constrained environments.

Overall, OTAvatar represents a significant contribution towards more efficient, control-oriented face avatar systems with wide-ranging capabilities and applications. The innovative combination of neural rendering techniques and latent space manipulation offers a robust framework that promises ongoing utility and expansion in the field of AI-driven animation.