Neural Voice Puppetry: Audio-driven Facial Reenactment (1912.05566v2)

Published 11 Dec 2019 in cs.CV and cs.GR

Abstract: We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis. Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples, including comparisons to state-of-the-art techniques and a user study.

Authors (5)

Justus Thies (62 papers)
Mohamed Elgharib (38 papers)
Ayush Tewari (43 papers)
Christian Theobalt (251 papers)
Matthias Nießner (177 papers)

Citations (351)

View on Semantic Scholar

Summary

The paper presents Audio2ExpressionNet, a temporal network that maps audio signals to 3D facial expressions with high stability.
It demonstrates cross-subject generalization by synthesizing realistic facial animations for unseen speakers using minimal training data.
The work introduces a novel neural rendering pipeline with neural textures for efficient, real-time photo-realistic output.

Neural Voice Puppetry: Audio-Driven Facial Reenactment

The paper entitled "Neural Voice Puppetry: Audio-Driven Facial Reenactment" introduces a novel framework called \OURS{} to synthesize high-quality facial animations from audio streams. This work integrates advancements in deep learning, particularly using latent 3D facial models, to address challenges in creating photo-realistic visual avatars that can be animated by any audio input, whether it be from a real person or a synthetic text-to-speech system.

Key Contributions

Audio2ExpressionNet Architecture: The paper presents a temporal network architecture known as Audio2ExpressionNet, which efficiently maps audio signals to 3D facial expressions. This architecture relies on generalized, pre-extracted features from a pre-trained speech-to-text network, aligning audio-driven facial reproduction with a high degree of temporal stability.
Generalization Across Subjects: A remarkable feature of \OURS{} is its ability to generalize across multiple subjects. By training on a broad dataset, the model can take audio inputs from unknown speakers and apply them to new facial avatars, a marked improvement over prior models requiring extensive subject-specific training data.
Neural Texture and Rendering: The paper presents a novel neural rendering network using neural textures, significantly improving the photo-realistic quality of the output while ensuring real-time performance. This method surpasses existing neural rendering approaches in both quality and computational efficiency.
Data Efficiency: Unlike previous methods requiring hours of video footage, this approach synthesizes realistic animations from mere minutes of the target video. This data efficiency is crucial for practical applications where extensive data collection is impractical.

Implications and Future Developments

The implications of this research are multifaceted. Practically, it allows the creation of digital avatars and virtual assistants with vivid expression capabilities, potentially transforming applications in entertainment, communication, and beyond. There is potential for enhancing teleconferencing experiences and creating dynamic avatars for virtual environments.

Theoretically, the approach introduces an effective way to leverage latent 3D facial models within neural networks, guiding future research in expression synthesis and human-robot interaction. By demonstrating that small amounts of data can be generalized for realistic output, this research paves the way for more personalized and scalable avatar systems.

A notable consideration for future work is ensuring the ethical deployment of such technologies, especially in content authenticity and misuse prevention. Additionally, improving the versatility and realism in more varied environmental contexts remains an area for further exploration.

Conclusion

The "Neural Voice Puppetry" paper presents significant advancements in audio-driven facial reenactment, leveraging the intersection of 3D facial modeling and neural networks to create realistic, expressive digital avatars. The contributions provide robust results in terms of rendering quality, generalization to unseen subjects, and data efficiency. This work sets a foundation for numerous applications in digital media while highlighting the continued need for responsible development and use of such transformative technologies.

PDF Markdown

Related Papers

YouTube

Show All Videos