EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head (2408.00297v1)

Published 1 Aug 2024 in cs.CV

Abstract: We present a novel approach for synthesizing 3D talking heads with controllable emotion, featuring enhanced lip synchronization and rendering quality. Despite significant progress in the field, prior methods still suffer from multi-view consistency and a lack of emotional expressiveness. To address these issues, we collect EmoTalk3D dataset with calibrated multi-view videos, emotional annotations, and per-frame 3D geometry. By training on the EmoTalk3D dataset, we propose a \textit{`Speech-to-Geometry-to-Appearance'} mapping framework that first predicts faithful 3D geometry sequence from the audio features, then the appearance of a 3D talking head represented by 4D Gaussians is synthesized from the predicted geometry. The appearance is further disentangled into canonical and dynamic Gaussians, learned from multi-view videos, and fused to render free-view talking head animation. Moreover, our model enables controllable emotion in the generated talking heads and can be rendered in wide-range views. Our method exhibits improved rendering quality and stability in lip motion generation while capturing dynamic facial details such as wrinkles and subtle expressions. Experiments demonstrate the effectiveness of our approach in generating high-fidelity and emotion-controllable 3D talking heads. The code and EmoTalk3D dataset are released at https://nju-3dv.github.io/projects/EmoTalk3D.

Authors (14)

Qianyun He (1 paper)
Xinya Ji (6 papers)
Yicheng Gong (6 papers)
Yuanxun Lu (9 papers)
Zhengyu Diao (2 papers)
Linjia Huang (2 papers)
Yao Yao (235 papers)
Siyu Zhu (64 papers)
Zhan Ma (91 papers)
Songcen Xu (41 papers)
Xiaofei Wu (31 papers)
Zixiao Zhang (4 papers)
Xun Cao (77 papers)
Hao Zhu (212 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a novel Speech-to-Geometry-to-Appearance framework to synthesize 3D talking heads with controllable emotional expressions.
It leverages an emotion-annotated multi-view dataset to generate dynamic facial details and ensure multi-view consistency.
Experimental results demonstrate superior rendering quality and synchronization, outperforming state-of-the-art methods on key evaluation metrics.

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

The paper "EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head" authored by Qianyun He et al. introduces a novel approach for synthesizing 3D talking heads with controllable emotional expressions. The method addresses significant challenges in current state-of-the-art techniques, particularly concerning multi-view consistency, emotional expressiveness, and rendering quality.

Key Contributions

The contributions of the paper are multi-faceted and stem from several innovative design choices:

Emotion-Annotated Multi-View Dataset:
- The paper introduces a new dataset designed to handle the deficiencies of previous datasets. This dataset includes multi-view videos, emotional annotations, and per-frame 3D geometries. The dataset consists of 35 subjects, each performing 20 sentences under eight emotions with two intensities per emotion.
Novel Mapping Framework:
- The authors propose a Speech-to-Geometry-to-Appearance mapping framework. This method efficiently maps input audio to a dynamic 4D point cloud through a Speech-to-Geometry Network (S2GNet). Subsequently, facial appearance is synthesized using dynamic Gaussians constructed from predicted geometries.
Dynamic Facial Detail Synthesis:
- The model excels in capturing dynamic facial details such as wrinkles and subtle expressions. This is facilitated by a Geometry-to-Appearance Network (G2ANet) which synthesizes the talking head's dynamic appearance from the 3D geometry, addressing multi-view consistency and enhancing emotional expressiveness.

Technical Approach

The technical approach of the paper can be outlined via the following components:

Audio Encoder and Emotion Extractor:
- Utilizes HuBERT for feature extraction from speech signals, and a transformer-based emotion extractor to decode emotional content from audio inputs.
Speech-to-Geometry Network (S2GNet):
- This network is responsible for generating 4D point clouds from encoded audio features and extracted emotional labels. The predicted 3D points accurately track facial expressions and lip motion derived from the input speech.
Geometry-to-Appearance Network (G2ANet):
- Takes the 4D point cloud as input to synthesize detailed facial appearances. This involves disentangling appearance into canonical (static) and dynamic Gaussians and predicting fine-grained facial details affected by speech and emotion.
Rendering and Completion Module:
- Combines dynamic and static Gaussians using a predefined weighting scheme to ensure smooth transitions and complete head synthesis, integrating non-facial elements like hair and neck.

Experimental Results

Experiments conducted demonstrate significant improvements in both rendering quality and realism of dynamic facial expressions. The EmoTalk3D model was evaluated using several standard metrics: PSNR, SSIM, LPIPS, LMD, and CPBD. The proposed method consistently achieved superior scores compared to other state-of-the-art approaches.

Multi-View Synthesis and Emotional Control:
- The method produces high-quality images with accurate rendering of a wide range of angles and emotions.
User Study:
- The user paper further substantiates the model's efficacy, with participants rating EmoTalk3D highly in terms of speech-visual synchronization, video fidelity, and image quality compared to competing methods.

Implications and Future Work

The practical implications of this research extend to various domains such as digital humans, virtual conferencing, and interactive robots. Theoretically, the proposed dataset and the Speech-to-Geometry-to-Appearance framework pave the way for future innovations in 3D talking head models.

Nevertheless, the paper acknowledges certain limitations. The model remains person-specific and requires a well-calibrated multi-view camera system for data collection. Additionally, it cannot effectively model dynamic hair movements, indicating an area for future enhancement. Future research could aim to generalize the model to work with single-image inputs and extend dynamic modeling to more complex scenarios, including diverse hairstyles and environmental interactions.

In conclusion, "EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head" provides a significant step forward in synthesizing highly detailed and emotionally expressive 3D talking heads, enhancing both the visual and interactive fidelity of virtual characters.

Related Papers

GitHub

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

Tweets

https://twitter.com/janusch_patas/status/1820320539618996692

https://twitter.com/_vztu/status/1819838351987572781

https://twitter.com/CSVisionPapers/status/1819448244620902770