Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness (2401.03476v1)

Published 7 Jan 2024 in cs.MM, cs.AI, cs.HC, cs.SD, and eess.AS

Abstract: Current talking avatars mostly generate co-speech gestures based on audio and text of the utterance, without considering the non-speaking motion of the speaker. Furthermore, previous works on co-speech gesture generation have designed network structures based on individual gesture datasets, which results in limited data volume, compromised generalizability, and restricted speaker movements. To tackle these issues, we introduce FreeTalker, which, to the best of our knowledge, is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions. Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions, utilizing heterogeneous data sourced from various motion datasets. During inference, we utilize classifier-free guidance to highly control the style in the clips. Additionally, to create smooth transitions between clips, we utilize DoubleTake, a method that leverages a generative prior and ensures seamless motion blending. Extensive experiments show that our method generates natural and controllable speaker movements. Our code, model, and demo are are available at \url{https://youngseng.github.io/FreeTalker/}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. “A comprehensive review of data-driven co-speech gesture generation,” in Computer Graphics Forum, 2023, vol. 42, pp. 569–596.
  2. “The genea challenge 2023: A large scale evaluation of gesture generation models in monadic and dyadic settings,” arXiv preprint arXiv:2308.12646, 2023.
  3. “Gtn-bailando: Genre consistent long-term 3d dance generation based on pre-trained genre token network,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  4. “Chain of generation: Multi-modal gesture synthesis via cascaded conditional control,” arXiv preprint arXiv:2312.15900, 2023.
  5. “Human motion generation: A survey,” arXiv preprint arXiv:2307.10894, 2023.
  6. “Zeroeggs: Zero-shot example-based gesture generation from speech,” in Computer Graphics Forum, 2023.
  7. “Listen, denoise, action! audio-driven motion synthesis with diffusion models,” ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–20, 2023.
  8. “Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models,” International Joint Conference on Artificial Intelligence, 2023.
  9. “Gesturediffuclip: Gesture diffusion model with clip latents,” ACM Trans. Graph., 2023.
  10. “Priority-centric human motion generation in discrete latent space,” arXiv preprint arXiv:2308.14480, 2023.
  11. “Human motion diffusion model,” in The Eleventh International Conference on Learning Representations, 2023.
  12. “Motiondiffuse: Text-driven human motion generation with diffusion model,” arXiv preprint arXiv:2208.15001, 2022.
  13. “Human motion diffusion as a generative prior,” arXiv preprint arXiv:2303.01418, 2023.
  14. “Pretrained diffusion models for unified human motion synthesis,” arXiv preprint arXiv:2212.02837, 2022.
  15. “Skeleton-aware networks for deep motion retargeting,” ACM Transactions on Graphics (TOG), vol. 39, no. 4, pp. 62–1, 2020.
  16. “Ude: A unified driving engine for human motion generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5632–5641.
  17. “Unifiedgesture: A unified gesture synthesis model for multiple skeletons,” ACM International Conference on Multimedia, 2023.
  18. “Denoising diffusion probabilistic models,” Advances in neural information processing systems, pp. 6840–6851, 2020.
  19. “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
  20. “Expressive body capture: 3d hands, face, and body from a single image,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10975–10985.
  21. “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5152–5161.
  22. “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
  23. “The diffusestylegesture+ entry to the genea challenge 2023,” arXiv preprint arXiv:2308.13879, 2023.
  24. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, 2022.
  25. “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  26. Peter J Huber, “Robust estimation of a location parameter,” in Breakthroughs in statistics: Methodology and distribution, pp. 492–518. Springer, 1992.
  27. “Edge: Editable dance generation from music,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 448–458.
  28. “Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,” in European Conference on Computer Vision, 2022.
  29. “Analyzing input and output representations for speech-driven gesture generation,” in Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, 2019, pp. 97–104.
  30. “Speech gesture generation from the trimodal context of text, audio, and speaker identity,” ACM Transactions on Graphics (TOG), 2020.
  31. “Image quality metrics: Psnr vs. ssim,” in 2010 20th international conference on pattern recognition. IEEE, 2010.
  32. “Efficient content-based sparse attention with routing transformers,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 53–68, 2021.
  33. “Reformer: The efficient transformer,” in International Conference on Learning Representations, 2020.
Citations (6)

Summary

We haven't generated a summary for this paper yet.