Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Interactive Conversational Head Generation (2307.02090v1)

Published 5 Jul 2023 in cs.CV

Abstract: We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation. The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications, including digital humans, virtual agents, and social robots. While existing research primarily focuses on talking head generation (one-way interaction), hindering the ability to create a digital human for conversation (two-way) interaction due to the absence of listening and interaction parts. In this work, we construct two datasets to address this issue, ViCo'' for independent talking and listening head generation tasks at the sentence level, andViCo-X'', for synthesizing interlocutors in multi-turn conversational scenarios. Based on ViCo and ViCo-X, we define three novel tasks targeting the interaction modeling during the face-to-face conversation: 1) responsive listening head generation making listeners respond actively to the speaker with non-verbal signals, 2) expressive talking head generation guiding speakers to be aware of listeners' behaviors, and 3) conversational head generation to integrate the talking/listening ability in one interlocutor. Along with the datasets, we also propose corresponding baseline solutions to the three aforementioned tasks. Experimental results show that our baseline method could generate responsive and vivid agents that can collaborate with real person to fulfil the whole conversation. Project page: https://vico.solutions/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. C. R. Berger, “Interpersonal communication: Theoretical perspectives, future prospects.” Journal of communication, 2005.
  2. J. M. Honeycutt and S. G. Ford, “Mental imagery and intrapersonal communication: A review of research on imagined interactions (iis) and current developments,” Annals of the International Communication Association, vol. 25, no. 1, pp. 315–345, 2001.
  3. J. Parker and E. Coiera, “Improving clinical communication: a view from psychology,” Journal of the American Medical Informatics Association, vol. 7, no. 5, pp. 453–461, 2000.
  4. S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik, “Learning individual styles of conversational gesture,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3497–3506.
  5. K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 484–492.
  6. J. S. Chung, A. Jamaludin, and A. Zisserman, “You said that?” arXiv preprint arXiv:1705.02966, 2017.
  7. K. Wang, Q. Wu, L. Song, Z. Yang, W. Wu, C. Qian, R. He, Y. Qiao, and C. C. Loy, “Mead: A large-scale audio-visual dataset for emotional talking-face generation,” in European Conference on Computer Vision.   Springer, 2020, pp. 700–717.
  8. H. Zhu, M.-D. Luo, R. Wang, A.-H. Zheng, and R. He, “Deep audio-visual learning: A survey,” International Journal of Automation and Computing, pp. 1–26, 2021.
  9. D. McNaughton, D. Hamlin, J. McCarthy, D. Head-Reeves, and M. Schreiner, “Learning to listen: Teaching an active listening strategy to preservice education professionals,” Topics in Early Childhood Special Education, vol. 27, no. 4, pp. 223–231, 2008.
  10. K. Robertson, “Active listening: more than just paying attention,” Australian family physician, vol. 34, no. 12, 2005.
  11. J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
  12. M. Gillies, X. Pan, M. Slater, and J. Shawe-Taylor, “Responsive listening behavior,” Computer animation and virtual worlds, vol. 19, no. 5, pp. 579–589, 2008.
  13. G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, “The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent,” IEEE transactions on affective computing, vol. 3, no. 1, pp. 5–17, 2011.
  14. D. Heylen, E. Bevacqua, C. Pelachaud, I. Poggi, J. Gratch, and M. Schröder, “Generating listening behaviour,” in Emotion-oriented systems.   Springer, 2011, pp. 321–347.
  15. H. Buschmeier, Z. Malisz, J. Skubisz, M. Wlodarczak, I. Wachsmuth, S. Kopp, and P. Wagner, “Alico: A multimodal corpus for the study of active listening,” in LREC 2014, Ninth International Conference on Language Resources and Evaluation, 26-31 May, Reykjavik, Iceland, 2014, pp. 3638–3643.
  16. C. Zhang, Y. Zhao, Y. Huang, M. Zeng, S. Ni, M. Budagavi, and X. Guo, “Facial: Synthesizing dynamic talking face with implicit attribute learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3867–3876.
  17. L. Yu, H. Xie, and Y. Zhang, “Multimodal learning for temporally coherent talking face generation with articulator synergy,” IEEE Transactions on Multimedia, vol. 24, pp. 2950–2962, 2021.
  18. S. E. Eskimez, Y. Zhang, and Z. Duan, “Speech driven talking face generation from a single image and an emotion condition,” IEEE Transactions on Multimedia, vol. 24, pp. 3480–3490, 2021.
  19. M. C. Doukas, E. Ververas, V. Sharmanska, and S. Zafeiriou, “Free-headgan: Neural talking head synthesis with explicit gaze control,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  20. A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, “Recycle-gan: Unsupervised video retargeting,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 119–135.
  21. S. Petridis, B. Martinez, and M. Pantic, “The mahnob laughter database,” Image and Vision Computing, vol. 31, no. 2, pp. 186–202, 2013.
  22. H. Joo, T. Simon, M. Cikara, and Y. Sheikh, “Towards social artificial intelligence: Nonverbal social signal prediction in a triadic interaction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 873–10 883.
  23. C. Oertel, P. Jonell, D. Kontogiorgos, K. F. Mora, J.-M. Odobez, and J. Gustafson, “Towards an engagement-aware attentive artificial listener for multi-party interactions,” Frontiers in Robotics and AI, p. 189, 2021.
  24. G. J. Stephens, L. J. Silbert, and U. Hasson, “Speaker–listener neural coupling underlies successful communication,” Proceedings of the National Academy of Sciences, vol. 107, no. 32, pp. 14 425–14 430, 2010.
  25. D. Heylen, E. Bevacqua, M. Tellier, and C. Pelachaud, “Searching for prototypical facial feedback signals,” in International Workshop on Intelligent Virtual Agents.   Springer, 2007, pp. 147–153.
  26. V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in Proceedings of the 26th annual conference on Computer graphics and interactive techniques, 1999, pp. 187–194.
  27. A. Tewari, M. Zollhoefer, F. Bernard, P. Garrido, H. Kim, P. Perez, and C. Theobalt, “High-fidelity monocular face reconstruction based on an unsupervised model-based face autoencoder,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 2, pp. 357–370, 2018.
  28. L. Tran and X. Liu, “On learning 3d face morphable model from in-the-wild images,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 1, pp. 157–171, 2019.
  29. J. Booth, A. Roussos, E. Ververas, E. Antonakos, S. Ploumpis, Y. Panagakis, and S. Zafeiriou, “3d reconstruction of “in-the-wild” faces in images and videos,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 11, pp. 2638–2652, 2018.
  30. Y. Deng, J. Yang, S. Xu, D. Chen, Y. Jia, and X. Tong, “Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set,” in IEEE Computer Vision and Pattern Recognition Workshops, 2019.
  31. H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Niessner, P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt, “Deep video portraits,” ACM Transactions on Graphics (TOG), vol. 37, no. 4, pp. 1–14, 2018.
  32. Y. Ren, G. Li, Y. Chen, T. H. Li, and S. Liu, “Pirenderer: Controllable portrait image generation via semantic neural rendering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 759–13 768.
  33. W. Bao, W.-S. Lai, X. Zhang, Z. Gao, and M.-H. Yang, “Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 3, pp. 933–948, 2019.
  34. H. Chen, Y. Jin, K. Xu, Y. Chen, and C. Zhu, “Multiframe-to-multiframe network for video denoising,” IEEE Transactions on Multimedia, vol. 24, pp. 2164–2178, 2021.
  35. C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2015.
  36. P. Yi, Z. Wang, K. Jiang, J. Jiang, T. Lu, and J. Ma, “A progressive fusion generative adversarial network for realistic and consistent video super-resolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2264–2280, 2020.
  37. R. Szeto, X. Sun, K. Lu, and J. J. Corso, “A temporally-aware interpolation network for video frame inpainting,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 5, pp. 1053–1068, 2019.
  38. M. Chen, R. Liu, L. Shen, S. Yuan, J. Zhou, Y. Wu, X. He, and B. Zhou, “The jddc corpus: A large-scale multi-turn chinese dialogue dataset for e-commerce customer service,” arXiv preprint arXiv:1911.09969, 2019.
  39. H. Bunt, J. Alexandersson, J.-W. Choe, A. C. Fang, K. Hasida, V. Petukhova, A. Popescu-Belis, and D. Traum, “Iso 24617-2: A semantically-based standard for dialogue annotation,” UNIVERSITY OF SOUTHERN CALIFORNIA LOS ANGELES, Tech. Rep., 2012.
  40. A. Kendon, “Movement coordination in social interaction: Some examples described,” Acta psychologica, vol. 32, pp. 101–125, 1970.
  41. R. Maatman, J. Gratch, and S. Marsella, “Natural behavior of a listening agent,” in International workshop on intelligent virtual agents.   Springer, 2005, pp. 25–36.
  42. A. Richard, M. Zollhöfer, Y. Wen, F. De la Torre, and Y. Sheikh, “Meshtalk: 3d face animation from speech using cross-modality disentanglement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1173–1182.
  43. G. Melis, T. Kočiskỳ, and P. Blunsom, “Mogrifier lstm,” arXiv preprint arXiv:1909.01792, 2019.
  44. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
  45. J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Asian conference on computer vision.   Springer, 2016, pp. 251–263.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Mohan Zhou (9 papers)
  2. Yalong Bai (23 papers)
  3. Wei Zhang (1492 papers)
  4. Ting Yao (127 papers)
  5. Tiejun Zhao (70 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.