Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 188 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 37 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Unified speech and gesture synthesis using flow matching (2310.05181v2)

Published 8 Oct 2023 in eess.AS, cs.GR, cs.HC, cs.LG, and cs.SD

Abstract: As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks. Please see https://shivammehta25.github.io/Match-TTSG/ for video examples and code.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. A. Kendon, “How gestures can become like words,” in Cross-Cultural Perspectives in Nonverbal Communication, 1988.
  2. S. Nyatsanga, T. Kucherenko, C. Ahuja, G. E. Henter, and M. Neff, “A comprehensive review of data-driven co-speech gesture generation,” Comput. Graph. Forum, 2023.
  3. P. Wagner, Z. Malisz, and S. Kopp, “Gesture and speech in interaction: An overview,” Speech Commun., vol. 57, 2014.
  4. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen et al., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. ICASSP, 2018.
  5. S. Mehta, A. Kirkland, H. Lameris, J. Beskow, É. Székely, and G. E. Henter, “OverFlow: Putting flows on top of neural transducers for better TTS,” in Proc. Interspeech, 2023.
  6. É. Székely, G. E. Henter, J. Beskow, and J. Gustafson, “Breathing and speech planning in spontaneous speech synthesis,” in Proc. ICASSP, 2020, pp. 7649–7653.
  7. S. Alexanderson, R. Nagy, J. Beskow, and G. E. Henter, “Listen, denoise, action! Audio-driven motion synthesis with diffusion models,” ACM ToG, vol. 42, no. 4, 2023, article 44.
  8. T. Ao, Z. Zhang, and L. Liu, “GestureDiffuCLIP: Gesture diffusion model with CLIP latents,” ACM ToG, vol. 42, no. 4, 2023, article 42.
  9. Y. Yoon, P. Wolfert, T. Kucherenko, C. Viegas, T. Nikolov, M. Tsakov, and G. E. Henter, “The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation,” in Proc. ICMI, 2022, pp. 736–747.
  10. S. Mehta, S. Wang, S. Alexanderson, J. Beskow, É. Székely, and G. E. Henter, “Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis,” in Proc. SSW, 2023.
  11. T. Kucherenko, R. Nagy, Y. Yoon, J. Woo, T. Nikolov, M. Tsakov, and G. E. Henter, “The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settings,” in Proc. ICMI, 2023.
  12. Y. Lipman, R. T. Q. Chen, H. Ben-Hamu et al., “Flow matching for generative modeling,” in Proc. ICLR, 2023.
  13. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in Proc. ICLR, 2021.
  14. V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A diffusion probabilistic model for text-to-speech,” in Proc. ICML, 2021, pp. 8599–8608.
  15. J. Kim, J. Kong, and J. Son, “VITS: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. ICML, 2021, pp. 5530–5540.
  16. S. Alexanderson, G. E. Henter, T. Kucherenko, and J. Beskow, “Style-controllable speech-driven gesture synthesis using normalising flows,” Comput. Graph. Forum, vol. 39, no. 2, 2020.
  17. C. Yu, H. Lu, N. Hu, M. Yu, C. Weng et al., “DurIAN: Duration informed attention network for multimodal synthesis,” arXiv preprint arXiv:1909.01700, 2019.
  18. K. Mitsui, Y. Hono, and K. Sawada, “UniFLG: Unified facial landmark generator from text or speech,” in Proc. Interspeech, 2023, pp. 5501–5505.
  19. M. Salem, S. Kopp, I. Wachsmuth, and F. Joublin, “Towards an integrated model of speech and gesture production for multi-modal robot behavior,” in Proc. RO-MAN, 2010, pp. 614–619.
  20. S. Alexanderson, É. Székely, G. E. Henter, T. Kucherenko, and J. Beskow, “Generating coherent spontaneous speech and gesture from text,” in Proc. IVA, 2020, pp. 1–3.
  21. S. Wang, S. Alexanderson, J. Gustafson, J. Beskow, G. E. Henter, and É. Székely, “Integrated speech and gesture synthesis,” in Proc. ICMI, 2021, pp. 177–185.
  22. J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” in Proc. NeurIPS, 2020, pp. 8067–8077.
  23. R. T. Q. Chen, Y. Rubanova, J. Bettencourt et al., “Neural ordinary differential equations,” in Proc. NeurIPS, 2018.
  24. X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” in Proc. ICLR, 2023.
  25. M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz et al., “Voicebox: Text-guided multilingual universal speech generation at scale,” arXiv preprint arXiv:2306.15687, 2023.
  26. V. T. Hu, W. Yin, P. Ma, Y. Chen, B. Fernando, Y. M. Asano, E. Gavves et al., “Motion flow matching for human motion synthesis and editing,” arXiv preprint arXiv:2312.08895, 2023.
  27. Y. Ren, C. Hu, X. Tan et al., “FastSpeech 2: Fast and high-quality end-to-end text to speech,” in Proc. ICLR, 2021.
  28. J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, “RoFormer: Enhanced Transformer with rotary position embedding,” arXiv preprint arXiv:2104.09864, 2021.
  29. U. Wennberg and G. E. Henter, “The case for translation-invariant self-attention in Transformer-based language models,” in Proc. ACL-IJCNLP Vol. 2, 2021, pp. 130–140.
  30. O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” in Proc. ICLR, 2022.
  31. P. Jonell, T. Kucherenko, G. E. Henter, and J. Beskow, “Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings,” in Proc. IVA, 2020.
  32. T. Kucherenko, P. Jonell, Y. Yoon, P. Wolfert, and G. E. Henter, “A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020,” in Proc. IUI, 2021, pp. 11–21.
  33. T. Kucherenko, P. Wolfert, Y. Yoon, C. Viegas, T. Nikolov, M. Tsakov, and G. E. Henter, “Evaluating gesture-generation in a large-scale open challenge: The GENEA Challenge 2022,” arXiv preprint arXiv:2303.08737, 2023.
  34. Y. Ferstl, M. Neff, and R. McDonnell, “ExpressGesture: Expressive gesture generation from speech through database matching,” Comput. Animat. Virt. W., p. e2016, 2021.
  35. ——, “Adversarial gesture generation with realistic gesture phasing,” Comput. Graph., vol. 89, pp. 117–130, 2020.
  36. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020, pp. 17 022–17 033.
  37. R. Prenger, R. Valle et al., “WaveGlow: A flow-based generative network for speech synthesis,” in Proc. ICASSP, 2019.
  38. S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” in Proc. ICLR, 2023.
  39. J. Taylor and K. Richmond, “Confidence intervals for ASR-based TTS evaluation,” in Proc. Interspeech, 2021.
  40. S. Wang, G. E. Henter, J. Gustafson, and É. Székely, “On the use of self-supervised speech representations in spontaneous speech synthesis,” in Proc. SSW, 2023.
  41. A. Deichler, S. Mehta, S. Alexanderson, and J. Beskow, “Diffusion-based co-speech gesture generation using joint text and audio representation,” in Proc. ICMI, 2023.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube