DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head Video Generation (2305.06225v2)
Abstract: Predominant techniques on talking head generation largely depend on 2D information, including facial appearances and motions from input face images. Nevertheless, dense 3D facial geometry, such as pixel-wise depth, plays a critical role in constructing accurate 3D facial structures and suppressing complex background noises for generation. However, dense 3D annotations for facial videos is prohibitively costly to obtain. In this work, firstly, we present a novel self-supervised method for learning dense 3D facial geometry (ie, depth) from face videos, without requiring camera parameters and 3D geometry annotations in training. We further propose a strategy to learn pixel-level uncertainties to perceive more reliable rigid-motion pixels for geometry learning. Secondly, we design an effective geometry-guided facial keypoint estimation module, providing accurate keypoints for generating motion fields. Lastly, we develop a 3D-aware cross-modal (ie, appearance and depth) attention mechanism, which can be applied to each generation layer, to capture facial geometries in a coarse-to-fine manner. Extensive experiments are conducted on three challenging benchmarks (ie, VoxCeleb1, VoxCeleb2, and HDTF). The results demonstrate that our proposed framework can generate highly realistic-looking reenacted talking videos, with new state-of-the-art performances established on these benchmarks. The codes and trained models are publicly available on the GitHub project page at https://github.com/harlanhong/CVPR2022-DaGAN
- G. Yao, Y. Yuan, T. Shao, and K. Zhou, “Mesh guided one-shot face reenactment using graph convolutional networks,” in ACM MM, 2020.
- F. Yin, Y. Zhang, X. Cun, M. Cao, Y. Fan, X. Wang, Q. Bai, B. Wu, J. Wang, and Y. Yang, “Styleheat: One-shot high-resolution editable talking face generation via pretrained stylegan,” arXiv preprint arXiv:2203.04036, 2022.
- E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky, “Few-shot adversarial learning of realistic neural talking head models,” in ICCV, 2019.
- R. Zhao, T. Wu, and G. Guo, “Sparse to dense motion transfer for face image animation,” in ICCV, 2021.
- Y. Ren, G. Li, Y. Chen, T. H. Li, and S. Liu, “Pirenderer: Controllable portrait image generation via semantic neural rendering,” in ICCV, 2021.
- B. Zhang, C. Qi, P. Zhang, B. Zhang, H. Wu, D. Chen, Q. Chen, Y. Wang, and F. Wen, “Metaportrait: Identity-preserving talking head generation with fast personalized adaptation,” arXiv preprint arXiv:2212.08062, 2022.
- L. Tran and X. Liu, “Nonlinear 3d face morphable model,” in CVPR, 2018.
- E. Wood, T. Baltrušaitis, C. Hewitt, M. Johnson, J. Shen, N. Milosavljević, D. Wilde, S. Garbin, T. Sharp, I. Stojiljković, et al., “3d face reconstruction with dense landmarks,” in ECCV, 2022.
- A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe, “First order motion model for image animation,” NeurIPS, 2019.
- T.-C. Wang, A. Mallya, and M.-Y. Liu, “One-shot free-view neural talking-head synthesis for video conferencing,” in CVPR, 2021.
- A. Siarohin, O. Woodford, J. Ren, M. Chai, and S. Tulyakov, “Motion representations for articulated animation,” in CVPR, 2021.
- F.-T. Hong, L. Zhang, L. Shen, and D. Xu, “Depth-aware generative adversarial network for talking head video generation,” in CVPR, 2022.
- Z. Zhang, L. Li, Y. Ding, and C. Fan, “Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset,” in CVPR, 2021.
- H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang, “Talking face generation by adversarially disentangled audio-visual representation,” in AAAI, 2019.
- S. Wang, L. Li, Y. Ding, C. Fan, and X. Yu, “Audio2head: Audio-driven one-shot talking-head generation with natural head motion,” arXiv preprint arXiv:2107.09293, 2021.
- Y. Deng, J. Yang, D. Chen, F. Wen, and X. Tong, “Disentangled and controllable face image generation via 3d imitative-contrastive learning,” in CVPR, 2020.
- H. Zhou, Y. Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose-controllable talking face generation by implicitly modularized audio-visual representation,” in CVPR, 2021.
- L. Chen, S. Srivastava, Z. Duan, and C. Xu, “Deep cross-modal audio-visual generation,” in ACM MM Workshops, 2017.
- H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba, “The sound of pixels,” in ECCV, 2018.
- R. Gao and K. Grauman, “2.5 d visual sound,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 324–333, 2019.
- H. Zhao, C. Gan, W.-C. Ma, and A. Torralba, “The sound of motions,” in ICCV, 2019.
- X. Xu, B. Dai, and D. Lin, “Recursive visual sound separation using minus-plus net,” in ICCV, 2019.
- C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba, “Music gesture for visual sound separation,” in CVPR, 2020.
- C. Gan, D. Huang, P. Chen, J. B. Tenenbaum, and A. Torralba, “Foley music: Learning to generate music from videos,” in ECCV, 2020.
- J. S. Chung, A. Jamaludin, and A. Zisserman, “You said that?,” arXiv preprint arXiv:1705.02966, 2017.
- Y. Song, J. Zhu, D. Li, X. Wang, and H. Qi, “Talking face generation by conditional recurrent adversarial network,” arXiv preprint arXiv:1804.04786, 2018.
- P. KR, R. Mukhopadhyay, J. Philip, A. Jha, V. Namboodiri, and C. Jawahar, “Towards automatic face-to-face translation,” in ACM MM, 2019.
- Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: speaker-aware talking-head animation,” ACM Transactions On Graphics, vol. 39, no. 6, pp. 1–15, 2020.
- S. Gururani, A. Mallya, T.-C. Wang, R. Valle, and M.-Y. Liu, “Space: Speech-driven portrait animation with controllable expression,” in ICCV, 2023.
- Y. Liu, L. Lin, F. Yu, C. Zhou, and Y. Li, “Moda: Mapping-once audio-driven portrait animation with dual attentions,” in ICCV, 2023.
- C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.-L. Chang, M. G. Yong, J. Lee, et al., “Mediapipe: A framework for building perception pipelines,” arXiv preprint arXiv:1906.08172, 2019.
- S. Yao, R. Zhong, Y. Yan, G. Zhai, and X. Yang, “Dfa-nerf: personalized talking head generation via disentangled face attributes neural rendering,” arXiv preprint arXiv:2201.00791, 2022.
- X. Liu, Y. Xu, Q. Wu, H. Zhou, W. Wu, and B. Zhou, “Semantic-aware implicit neural audio-driven video portrait generation,” in ECCV, 2022.
- Y. Guo, K. Chen, S. Liang, Y.-J. Liu, H. Bao, and J. Zhang, “Ad-nerf: Audio driven neural radiance fields for talking head synthesis,” in ICCV, 2021.
- B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
- K. Kim, Y. Kim, S. Cho, J. Seo, J. Nam, K. Lee, S. Kim, and K. Lee, “Diffface: Diffusion-based face swapping with facial guidance,” arXiv preprint arXiv:2212.13344, 2022.
- S. Shen, W. Zhao, Z. Meng, W. Li, Z. Zhu, J. Zhou, and J. Lu, “Difftalk: Crafting diffusion models for generalized talking head synthesis,” 2023.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPSl, 2020.
- M. Stypułkowski, K. Vougioukas, S. He, M. Zięba, S. Petridis, and M. Pantic, “Diffused heads: Diffusion models beat gans on talking-face generation,” arXiv preprint arXiv:2301.03396, 2023.
- O. Wiles, A. Koepke, and A. Zisserman, “X2face: A network for controlling face generation using images, audio, and pose codes,” in ECCV, 2018.
- Y. Zhang, S. Zhang, Y. He, C. Li, C. C. Loy, and Z. Liu, “One-shot face reenactment,” arXiv preprint arXiv:1908.03251, 2019.
- E. Burkov, I. Pasechnik, A. Grigorev, and V. Lempitsky, “Neural head reenactment with latent pose descriptors,” in CVPR, 2020.
- Q. Wang, L. Zhang, and B. Li, “Safa: Structure aware face animation,” in 3DV, 2021.
- V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in SIGGRAPH, 1999.
- J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway, “A 3d morphable model learnt from 10,000 faces,” in CVPR, 2016.
- Y. Deng, J. Yang, S. Xu, D. Chen, Y. Jia, and X. Tong, “Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set,” in CVPR Workshops, 2019.
- S. Bounareli, C. Tzelepis, V. Argyriou, I. Patras, and G. Tzimiropoulos, “Hyperreenact: one-shot reenactment via jointly learning to refine and retarget faces,” in ICCV, 2023.
- J. Zhao and H. Zhang, “Thin-plate spline motion model for image animation,” in CVPR, 2022.
- F.-T. Hong and D. Xu, “Implicit identity representation conditioned memory compensation network for talking head video generation,” in ICCV, 2023.
- Y. Gong, Y. Zhang, X. Cun, F. Yin, Y. Fan, X. Wang, B. Wu, and Y. Yang, “Toontalker: Cross-domain face reenactment,” in ICCV, 2023.
- Y. Wang, D. Yang, F. Bremond, and A. Dantcheva, “Latent image animator: Learning to animate images via latent space navigation,” ICLR, 2022.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
- H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in CVPR, 2018.
- A. Pilzer, D. Xu, M. Puscas, E. Ricci, and N. Sebe, “Unsupervised adversarial depth estimation using cycled generative networks,” in 3DV, 2018.
- C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth prediction,” in ICCV, 2019.
- H. Ha, S. Im, J. Park, H.-G. Jeon, and I. S. Kweon, “High-quality depth from uncalibrated small motion clip,” in CVPR, 2016.
- X. Luo, J.-B. Huang, R. Szeliski, K. Matzen, and J. Kopf, “Consistent video depth estimation,” ACM Transactions on Graphics (ToG), vol. 39, no. 4, pp. 71–1, 2020.
- D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scale continuous crfs as sequential deep networks for monocular depth estimation,” in CVPR, 2017.
- T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in CVPR, 2017.
- A. Gordon, H. Li, R. Jonschkowski, and A. Angelova, “Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras,” in ICCV, 2019.
- D. Xu, A. Vedaldi, and J. F. Henriques, “Moving slam: Fully unsupervised deep learning in non-rigid scenes,” in IROS, 2021.
- Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” TIP, 2004.
- J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in ECCV, 2016.
- A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
- J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
- J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Interspeech, 2018.
- A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe, “Animating arbitrary objects via deep motion transfer,” in CVPR, 2019.
- S. Ha, M. Kersner, B. Kim, S. Seo, and D. Kim, “Marionette: Few-shot face reenactment preserving identity of unseen targets,” in AAAI, 2020.
- S. Varghese, Y. Bayzidi, A. Bar, N. Kapoor, S. Lahiri, J. D. Schneider, N. M. Schmidt, P. Schlicht, F. Huger, and T. Fingscheidt, “Unsupervised temporal consistency metric for video segmentation in highly-automated driving,” in CVPRW, 2020.
- G. Bradski, “The opencv library.,” Dr. Dobb’s Journal: Software Tools for the Professional Programmer, vol. 25, no. 11, pp. 120–123, 2000.