Emergent Mind

Abstract

We present a novel approach for synthesizing 3D talking heads with controllable emotion, featuring enhanced lip synchronization and rendering quality. Despite significant progress in the field, prior methods still suffer from multi-view consistency and a lack of emotional expressiveness. To address these issues, we collect EmoTalk3D dataset with calibrated multi-view videos, emotional annotations, and per-frame 3D geometry. By training on the EmoTalk3D dataset, we propose a \textit{`Speech-to-Geometry-to-Appearance'} mapping framework that first predicts faithful 3D geometry sequence from the audio features, then the appearance of a 3D talking head represented by 4D Gaussians is synthesized from the predicted geometry. The appearance is further disentangled into canonical and dynamic Gaussians, learned from multi-view videos, and fused to render free-view talking head animation. Moreover, our model enables controllable emotion in the generated talking heads and can be rendered in wide-range views. Our method exhibits improved rendering quality and stability in lip motion generation while capturing dynamic facial details such as wrinkles and subtle expressions. Experiments demonstrate the effectiveness of our approach in generating high-fidelity and emotion-controllable 3D talking heads. The code and EmoTalk3D dataset are released at https://nju-3dv.github.io/projects/EmoTalk3D.

Multi-view talking face dataset with 8 emotions and per-frame 3D mesh models.

Overview

  • The paper 'EmoTalk3D' presents a novel approach to synthesize 3D talking heads with controllable emotional expressions, significantly improving over current methods in terms of multi-view consistency, emotional expressiveness, and rendering quality.

  • Key innovations include an Emotion-Annotated Multi-View Dataset, a Speech-to-Geometry-to-Appearance mapping framework, and dynamic facial detail synthesis using advanced networks like S2GNet and G2ANet.

  • The paper demonstrates substantial enhancements in rendering quality and realism, showing superior performance in standard metrics and user studies, with implications for digital humans and virtual interactions.

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

The paper "EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head" authored by Qianyun He et al. introduces a novel approach for synthesizing 3D talking heads with controllable emotional expressions. The method addresses significant challenges in current state-of-the-art techniques, particularly concerning multi-view consistency, emotional expressiveness, and rendering quality.

Key Contributions

The contributions of the paper are multi-faceted and stem from several innovative design choices:

Emotion-Annotated Multi-View Dataset:

  • The paper introduces a new dataset designed to handle the deficiencies of previous datasets. This dataset includes multi-view videos, emotional annotations, and per-frame 3D geometries. The dataset consists of 35 subjects, each performing 20 sentences under eight emotions with two intensities per emotion.

Novel Mapping Framework:

  • The authors propose a Speech-to-Geometry-to-Appearance mapping framework. This method efficiently maps input audio to a dynamic 4D point cloud through a Speech-to-Geometry Network (S2GNet). Subsequently, facial appearance is synthesized using dynamic Gaussians constructed from predicted geometries.

Dynamic Facial Detail Synthesis:

  • The model excels in capturing dynamic facial details such as wrinkles and subtle expressions. This is facilitated by a Geometry-to-Appearance Network (G2ANet) which synthesizes the talking head's dynamic appearance from the 3D geometry, addressing multi-view consistency and enhancing emotional expressiveness.

Technical Approach

The technical approach of the paper can be outlined via the following components:

Audio Encoder and Emotion Extractor:

  • Utilizes HuBERT for feature extraction from speech signals, and a transformer-based emotion extractor to decode emotional content from audio inputs.

Speech-to-Geometry Network (S2GNet):

Geometry-to-Appearance Network (G2ANet):

  • Takes the 4D point cloud as input to synthesize detailed facial appearances. This involves disentangling appearance into canonical (static) and dynamic Gaussians and predicting fine-grained facial details affected by speech and emotion.

Rendering and Completion Module:

  • Combines dynamic and static Gaussians using a predefined weighting scheme to ensure smooth transitions and complete head synthesis, integrating non-facial elements like hair and neck.

Experimental Results

Experiments conducted demonstrate significant improvements in both rendering quality and realism of dynamic facial expressions. The EmoTalk3D model was evaluated using several standard metrics: PSNR, SSIM, LPIPS, LMD, and CPBD. The proposed method consistently achieved superior scores compared to other state-of-the-art approaches.

Multi-View Synthesis and Emotional Control:

  • The method produces high-quality images with accurate rendering of a wide range of angles and emotions.

User Study:

  • The user study further substantiates the model's efficacy, with participants rating EmoTalk3D highly in terms of speech-visual synchronization, video fidelity, and image quality compared to competing methods.

Implications and Future Work

The practical implications of this research extend to various domains such as digital humans, virtual conferencing, and interactive robots. Theoretically, the proposed dataset and the Speech-to-Geometry-to-Appearance framework pave the way for future innovations in 3D talking head models.

Nevertheless, the paper acknowledges certain limitations. The model remains person-specific and requires a well-calibrated multi-view camera system for data collection. Additionally, it cannot effectively model dynamic hair movements, indicating an area for future enhancement. Future research could aim to generalize the model to work with single-image inputs and extend dynamic modeling to more complex scenarios, including diverse hairstyles and environmental interactions.

In conclusion, "EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head" provides a significant step forward in synthesizing highly detailed and emotionally expressive 3D talking heads, enhancing both the visual and interactive fidelity of virtual characters.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.