Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 214 tok/s Pro
GPT OSS 120B 470 tok/s Pro
Claude Sonnet 4 40 tok/s Pro
2000 character limit reached

EmoVOCA: Speech-Driven Emotional 3D Talking Heads (2403.12886v3)

Published 19 Mar 2024 in cs.CV

Abstract: The domain of 3D talking head generation has witnessed significant progress in recent years. A notable challenge in this field consists in blending speech-related motions with expression dynamics, which is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Whereas literature works attempted to exploit 2D video data and parametric 3D models as a workaround, these still show limitations when jointly modeling the two motions. In this work, we address this problem from a different perspective, and propose an innovative data-driven technique that we used for creating a synthetic dataset, called EmoVOCA, obtained by combining a collection of inexpressive 3D talking heads and a set of 3D expressive sequences. To demonstrate the advantages of this approach, and the quality of the dataset, we then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face. Comprehensive experiments, both quantitative and qualitative, using our data and generator evidence superior ability in synthesizing convincing animations, when compared with the best performing methods in the literature. Our code and pre-trained model will be made available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. wav2vec 2.0: A framework for self-supervised learning of speech representations. CoRR, abs/2006.11477, 2020.
  2. Neural 3D morphable models: Spiral convolutional networks for 3D shape representation learning and generation. In IEEE/CVF Int. Conf. on Computer Vision (CVPR), pages 7213–7222, 2019.
  3. Disentangling audio content and emotion with adaptive instance normalization for expressive facial animation synthesis. Computer Animation and Virtual Worlds, 33(3-4):e2076, 2022.
  4. Capture, learning, and synthesis of 3D speaking styles. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2019.
  5. Emoca: Emotion driven monocular face capture and animation. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 20311–20322, 2022.
  6. Emotional speech-driven animation with content-emotion disentanglement. In SIGGRAPH Asia 2023 Conference Papers, SA ’23, New York, NY, USA, 2023. Association for Computing Machinery.
  7. Facial animation based on context-dependent visemes. Computers & Graphics, 30(6):971–980, 2006.
  8. Jali: An animator-centric viseme model for expressive lip synchronization. ACM Trans. on Graphics, 35(4), jul 2016.
  9. Faceformer: Speech-driven 3D facial animation with transformers. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 18770–18780, 2022.
  10. G.A. Kalberer and L. Van Gool. Face animation based on observed 3D speech dynamics. In IEEE Conf. on Computer Animation, pages 20–251, 2001.
  11. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. on Graphics, 36(4), jul 2017.
  12. Adam: A method for stochastic optimization, 2017.
  13. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017.
  14. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018.
  15. Audio-driven 3D facial animation from in-the-wild videos, 2023.
  16. Learning landmarks motion from speech for speaker-agnostic 3D talking heads generation. In Int. Conf. on Image Analysis and Processing (ICIAP), 2023.
  17. Sparse to dense dynamic 3D facial expression generation. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pages 20385–20394, 2022.
  18. Selftalk: A self-supervised commutative training diagram to comprehend 3d talking faces, 2023.
  19. Emotalk: Speech-driven emotional disentanglement for 3D face animation. arXiv preprint arXiv:2303.11089, 2023.
  20. The Florence 4D facial expression dataset. In IEEE Int. Conf. on Automatic Face and Gesture Recognition (FG), pages 1–6, 2023.
  21. Meshtalk: 3D face animation from speech using cross-modality disentanglement. In IEEE/CVF Int. Conf. on Computer Vision (CVPR), pages 1173–1182, 2021.
  22. Facediffuser: Speech-driven 3D facial animation synthesis using diffusion. In ACM SIGGRAPH Conference on Motion, Interaction and Games (MIG’23), 2023.
  23. Laughtalk: Expressive 3d talking head generation with laughter, 2023.
  24. Imitator: Personalized speech-driven 3D facial animation. In IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 20621–20631, October 2023.
  25. Laurens van der Maaten and Geoffrey Hinton. Viualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 11 2008.
  26. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube