Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learn2Talk: 3D Talking Face Learns from 2D Talking Face (2404.12888v1)

Published 19 Apr 2024 in cs.CV, cs.GR, and cs.LG

Abstract: Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy. Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Trans. Graph., vol. 42, no. 4, pp. 139:1–139:14, 2023.
  2. C. Sheng, G. Kuang, L. Bai, C. Hou, Y. Guo, X. Xu, M. Pietikäinen, and L. Liu, “Deep learning for visual speech analysis: A survey,” arXiv preprint arXiv:2205.10839, 2022.
  3. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. of NIPS, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds., 2014, pp. 2672–2680.
  4. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proc. of CVPR, 2019, pp. 4401–4410.
  5. R. Liu, C. Li, H. Cao, Y. Zheng, M. Zeng, and X. Cheng, “EMEF: ensemble multi-exposure image fusion,” in Proc. of AAAI, 2023, pp. 1710–1718.
  6. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. of NIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.
  7. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. of CVPR, 2022, pp. 10 674–10 685.
  8. R. Liu, B. Ma, W. Zhang, Z. Hu, C. Fan, T. Lv, Y. Ding, and X. Cheng, “Towards a simultaneous and granular identity-expression control in personalized face generation,” arXiv preprint arXiv:2401.01207, 2024.
  9. J. S. Chung, A. Jamaludin, and A. Zisserman, “You said that?” in Proc. of BMVC, 2017.
  10. L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu, “Lip movements generation at a glance,” in Proc. of ECCV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11211, 2018, pp. 538–553.
  11. K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. V. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proc. of ACM MM, C. W. Chen, R. Cucchiara, X. Hua, G. Qi, E. Ricci, Z. Zhang, and R. Zimmermann, Eds., 2020, pp. 484–492.
  12. K. R. Prajwal, R. Mukhopadhyay, J. Philip, A. Jha, V. P. Namboodiri, and C. V. Jawahar, “Towards automatic face-to-face translation,” in Proc. of ACM MM, L. Amsaleg, B. Huet, M. A. Larson, G. Gravier, H. Hung, C. Ngo, and W. T. Ooi, Eds., 2019, pp. 1428–1436.
  13. K. Vougioukas, S. Petridis, and M. Pantic, “Realistic speech-driven facial animation with gans,” Int. J. Comput. Vis., vol. 128, no. 5, pp. 1398–1413, 2020.
  14. H. Zhou, Y. Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose-controllable talking face generation by implicitly modularized audio-visual representation,” in Proc. of CVPR, 2021, pp. 4176–4186.
  15. L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” in Proc. of CVPR, 2019, pp. 7832–7841.
  16. L. Chen, G. Cui, C. Liu, Z. Li, Z. Kou, Y. Xu, and C. Xu, “Talking-head generation with rhythmic head motion,” in Proc. of ECCV, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, Eds., vol. 12354, 2020, pp. 35–51.
  17. Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: speaker-aware talking-head animation,” ACM Trans. Graph., vol. 39, no. 6, pp. 221:1–221:15, 2020.
  18. Y. Lu, J. Chai, and X. Cao, “Live speech portraits: real-time photorealistic talking-head animation,” ACM Trans. Graph., vol. 40, no. 6, pp. 220:1–220:17, 2021.
  19. Z. Zhang, L. Li, Y. Ding, and C. Fan, “Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset,” in Proc. of CVPR, 2021, pp. 3661–3670.
  20. W. Zhang, X. Cun, X. Wang, Y. Zhang, X. Shen, Y. Guo, Y. Shan, and F. Wang, “Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation,” in Proc. of CVPR, 2023, pp. 8652–8661.
  21. Z. Zhang, Z. Hu, W. Deng, C. Fan, T. Lv, and Y. Ding, “Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video,” in Proc. of AAAI, B. Williams, Y. Chen, and J. Neville, Eds., 2023, pp. 3543–3551.
  22. D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. J. Black, “Capture, learning, and synthesis of 3d speaking styles,” in Proc. of CVPR, 2019, pp. 10 101–10 111.
  23. A. Richard, M. Zollhöfer, Y. Wen, F. D. la Torre, and Y. Sheikh, “Meshtalk: 3d face animation from speech using cross-modality disentanglement,” in Proc. of ICCV, 2021, pp. 1153–1162.
  24. Y. Fan, Z. Lin, J. Saito, W. Wang, and T. Komura, “Faceformer: Speech-driven 3d facial animation with transformers,” in Proc. of CVPR, 2022, pp. 18 749–18 758.
  25. J. Xing, M. Xia, Y. Zhang, X. Cun, J. Wang, and T. Wong, “Codetalker: Speech-driven 3d facial animation with discrete motion prior,” in Proc. of CVPR, 2023, pp. 12 780–12 790.
  26. Y. Zhou, Z. Xu, C. Landreth, E. Kalogerakis, S. Maji, and K. Singh, “Visemenet: audio-driven animator-centric speech animation,” ACM Trans. Graph., vol. 37, no. 4, p. 161, 2018.
  27. M. V. Aylagas, H. A. Leon, M. Teye, and K. Tollmar, “Voice2face: Audio-driven facial and tongue rig animations with cvaes,” Comput. Graph. Forum, vol. 41, no. 8, pp. 255–265, 2022.
  28. S. Stan, K. I. Haque, and Z. Yumak, “Facediffuser: Speech-driven 3d facial animation synthesis using diffusion,” in ACM Conference on Motion, Interaction and Games, J. Pettré, B. Solenthaler, R. McDonnell, and C. Peters, Eds., 2023, pp. 13:1–13:11.
  29. H. Yi, H. Liang, Y. Liu, Q. Cao, Y. Wen, T. Bolkart, D. Tao, and M. J. Black, “Generating holistic 3d human motion from speech,” in Proc. of CVPR, 2023, pp. 469–480.
  30. T. Ao, Q. Gao, Y. Lou, B. Chen, and L. Liu, “Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings,” ACM Trans. Graph., vol. 41, no. 6, pp. 209:1–209:19, 2022.
  31. S. Alexanderson, R. Nagy, J. Beskow, and G. E. Henter, “Listen, denoise, action! audio-driven motion synthesis with diffusion models,” ACM Trans. Graph., vol. 42, no. 4, pp. 44:1–44:20, 2023.
  32. N. Athanasiou, M. Petrovich, M. J. Black, and G. Varol, “TEACH: temporal action composition for 3d humans,” in Proc. of 3DV, 2022, pp. 414–423.
  33. Z. Zhou and B. Wang, “UDE: A unified driving engine for human motion generation,” in Proc. of CVPR, 2023, pp. 5632–5641.
  34. J. Lin, J. Chang, L. Liu, G. Li, L. Lin, Q. Tian, and C. W. Chen, “Being comes from not-being: Open-vocabulary text-to-motion generation with wordless training,” in Proc. of CVPR, 2023, pp. 23 222–23 231.
  35. Z. Wan, Z. Li, M. Tian, J. Liu, S. Yi, and H. Li, “Encoder-decoder with multi-level attention for 3d human shape and pose estimation,” in Proc. of ICCV, 2021, pp. 13 013–13 022.
  36. Y. Yuan, S. Wei, T. Simon, K. Kitani, and J. M. Saragih, “Simpoe: Simulated character control for 3d human pose estimation,” in Proc. of CVPR, 2021, pp. 7159–7169.
  37. W. Wei, J. Lin, T. Liu, and H. M. Liao, “Capturing humans in motion: Temporal-attentive 3d human pose and shape estimation from monocular video,” in Proc. of CVPR, 2022, pp. 13 201–13 210.
  38. J. S. Chung and A. Zisserman, “Out of time: Automated lip sync in the wild,” in Proc. of ACCV Workshop, C. Chen, J. Lu, and K. Ma, Eds., vol. 10117, 2016, pp. 251–263.
  39. P. Ma, S. Petridis, and M. Pantic, “Visual speech recognition for multiple languages in the wild,” Nat. Mac. Intell., vol. 4, no. 11, pp. 930–939, 2022.
  40. T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, “Audio-driven facial animation by joint end-to-end learning of pose and emotion,” ACM Trans. Graph., vol. 36, no. 4, pp. 94:1–94:12, 2017.
  41. J. Liu, B. Hui, K. Li, Y. Liu, Y. Lai, Y. Zhang, Y. Liu, and J. Yang, “Geometry-guided dense perspective network for speech-driven facial animation,” IEEE Trans. Vis. Comput. Graph., vol. 28, no. 12, pp. 4873–4886, 2022.
  42. A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Proc. of NIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017, pp. 6306–6315.
  43. Z. Sun, T. Lv, S. Ye, M. G. Lin, J. Sheng, Y.-H. Wen, M. Yu, and Y. jin Liu, “Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models,” arXiv preprint arXiv:2310.00434, 2023.
  44. S. Sanyal, T. Bolkart, H. Feng, and M. J. Black, “Learning to regress 3d face shape and expression from an image without 3d supervision,” in Proc. of CVPR, 2019, pp. 7763–7772.
  45. Y. Deng, J. Yang, S. Xu, D. Chen, Y. Jia, and X. Tong, “Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set,” in Proc. of CVPR Workshops, 2019, pp. 285–295.
  46. Y. Feng, H. Feng, M. J. Black, and T. Bolkart, “Learning an animatable detailed 3d face model from in-the-wild images,” ACM Trans. Graph., vol. 40, no. 4, pp. 88:1–88:13, 2021.
  47. R. Danecek, M. J. Black, and T. Bolkart, “EMOCA: emotion driven monocular face capture and animation,” in Proc. of CVPR, 2022, pp. 20 279–20 290.
  48. R. Liu, Y. Cheng, S. Huang, C. Li, and X. Cheng, “Transformer-based high-fidelity facial displacement completion for detailed 3d face reconstruction,” IEEE Trans. on Multim., pp. 1–13, 2023.
  49. C. Cao, Q. Hou, and K. Zhou, “Displaced dynamic expression regression for real-time facial tracking and animation,” ACM Trans. Graph., vol. 33, no. 4, pp. 43:1–43:10, 2014.
  50. C. Cao, D. Bradley, K. Zhou, and T. Beeler, “Real-time high-fidelity facial performance capture,” ACM Trans. Graph., vol. 34, no. 4, pp. 46:1–46:9, 2015.
  51. C. Wang, F. Shi, S. Xia, and J. Chai, “Realtime 3d eye gaze animation using a single RGB camera,” ACM Trans. Graph., vol. 35, no. 4, pp. 118:1–118:14, 2016.
  52. P. Garrido, M. Zollhöfer, D. Casas, L. Valgaerts, K. Varanasi, P. Pérez, and C. Theobalt, “Reconstruction of personalized 3d face rigs from monocular video,” ACM Trans. Graph., vol. 35, no. 3, pp. 28:1–28:15, 2016.
  53. P. P. Filntisis, G. Retsinas, F. P. Papantoniou, A. Katsamanis, A. Roussos, and P. Maragos, “Visual speech-aware perceptual 3d facial expression reconstruction from videos,” arXiv preprint arXiv:2207.11094, 2022.
  54. V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in Proc. of SIGGRAPH, 1999, pp. 187–194.
  55. D. Vlasic, M. Brand, H. Pfister, and J. Popovic, “Face transfer with multilinear models,” ACM Trans. Graph., vol. 24, no. 3, pp. 426–433, 2005.
  56. Z. Wang, J. Chai, and S. Xia, “Realtime and accurate 3d eye gaze capture with dcnn-based iris and pupil segmentation,” IEEE Trans. Vis. Comput. Graph., vol. 27, no. 1, pp. 190–203, 2021.
  57. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. of NIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.
  58. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of NIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017, pp. 5998–6008.
  59. S. Wang, L. Li, Y. Ding, and X. Yu, “One-shot talking face generation from single-speaker audio-visual correlation learning,” in Proc. of AAAI, 2022, pp. 2531–2539.
  60. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Proc. of NIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017, pp. 6626–6637.
  61. N. D. Narvekar and L. J. Karam, “A no-reference image blur metric based on the cumulative probability of blur detection (CPBD),” IEEE Trans. Image Process., vol. 20, no. 9, pp. 2678–2683, 2011.
  62. A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. of Interspeech, H. Meng, B. Xu, and T. F. Zheng, Eds., 2020, pp. 5036–5040.
  63. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proc. of ICLR, Y. Bengio and Y. LeCun, Eds., 2014.
  64. G. Fanelli, J. Gall, H. Romsdorfer, T. Weise, and L. V. Gool, “A 3-d audio-visual corpus of affective communication,” IEEE Trans. Multim., vol. 12, no. 6, pp. 591–598, 2010.
  65. T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4d scans,” ACM Trans. Graph., vol. 36, no. 6, pp. 194:1–194:17, 2017.
  66. P. Ma, A. Haliassos, A. Fernandez-Lopez, H. Chen, S. Petridis, and M. Pantic, “Auto-avsr: Audio-visual speech recognition with automatic labels,” in Proc. of ICASSP, 2023, pp. 1–5.
  67. G. Chen and W. Wang, “A survey on 3d gaussian splatting,” arXiv preprint arXiv:2401.03890, 2024.
  68. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in Proc. of ECCV, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, Eds., vol. 12346, 2020, pp. 405–421.
  69. S. Qian, T. Kirschstein, L. Schoneveld, D. Davoli, S. Giebenhain, and M. Nießner, “Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians,” arXiv preprint arXiv:2312.02069, 2023.
  70. Y. Xu, B. Chen, Z. Li, H. Zhang, L. Wang, Z. Zheng, and Y. Liu, “Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians,” arXiv preprint arXiv:2312.03029, 2023.
  71. Z. Li, Z. Zheng, L. Wang, and Y. Liu, “Animatable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling,” arXiv preprint arXiv:2311.16096, 2023.
  72. Y. Jiang, Q. Liao, X. Li, L. Ma, Q. Zhang, C. Zhang, Z. Lu, and Y. Shan, “Uv gaussians: Joint learning of mesh deformation and gaussian textures for human avatar modeling,” arXiv preprint arXiv:2403.11589, 2024.
Citations (3)

Summary

  • The paper presents a deep learning framework that converts 2D talking face data into high-quality 3D facial animations.
  • It leverages existing 2D facial annotations to accurately reconstruct 3D geometry and dynamic expressions.
  • Experimental results show improved lip synchronization and realism, highlighting its potential for animation and virtual communication.

Overview of the IEEEtran LaTeX Templates Usage Guide

Introduction and Purpose

The publication guide presents a comprehensive overview of the IEEEtran LaTeX class file, which is specifically tailored for creating IEEE publications that conform to their typesetting specifications. This documentation elucidates the varied document types that can be generated, such as journal articles, conference papers, and technical notes, each being tailored through distinct class options.

Template Design and Intent

IEEEtran templates aim to approximate the ultimate presentation and length of articles intended for IEEE publications, though they are not the final layout seen in print or digital libraries. Key design metrics consider page length approximation and facilitate the conversion process to XML, which publishers use for the final composition in various formats including IEEE Xplore®. These templates serve more as a structural guide rather than the final layout protocol.

Template and LaTeX Distribution Sources

Users are directed towards multiple sources for obtaining the IEEEtran templates and LaTeX distributions. The IEEE Template Selector is highlighted as the primary source for the most current templates. For LaTeX distributions, the TeX Users Group (TUG) at tug.org is recommended, providing comprehensive resources for various operating systems.

Usage and Customization

The document delineates usage scenarios by specifying appropriate documentclass options for different types of publications. It provides a granular breakdown of template structures for journals and conferences affiliated with different IEEE societies, including the Computer Society and Communications Society. Each kind of document class is tailored to meet the specific submission standards of these societies.

Practical Guides and Examples

In-depth coding examples are provided for common formatting needs within a document, including title, author details, and index terms. It offers specific instructions for front matter (title, author, running heads) and common body elements like section headings, figures, and tables. The documentation is clear on the need for consistent formatting, especially concerning special content such as mathematical equations and complex tables.

Support and Additional Resources

The guide does not leave users to troubleshoot alone but points to various LaTeX user groups and forums where both novice and experienced users can seek advice or find solutions to common and complex problems.

Implications and Future Directions

By standardizing the approach to manuscript preparation for IEEE publications, the IEEEtran LaTeX class file assists in maintaining consistency and high quality in scholarly publications. Looking ahead, as digital publishing evolves, templates like IEEEtran will need continual updates to accommodate new typesetting technologies and publication standards. Future updates might include enhanced integration with digital tools that automatically check for adherence to IEEE styling guidelines or more sophisticated XML conversion tools that streamline the publication process further.

Conclusion

In conclusion, the IEEEtran LaTeX templates provide an essential resource for authors targeting IEEE journals and conferences, encapsulating the formatting requirements into a functional tool that aids in producing compliant and professionally formatted manuscripts. This guide is a crucial asset for understanding and utilizing the IEEEtran class effectively, ensuring that submissions meet the expected scholarly standards set forth by IEEE.

X Twitter Logo Streamline Icon: https://streamlinehq.com