Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 60 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 14 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 159 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models (2402.12423v2)

Published 19 Feb 2024 in cs.SD, cs.CL, cs.LG, and eess.AS

Abstract: The incorporation of Denoising Diffusion Models (DDMs) in the Text-to-Speech (TTS) domain is rising, providing great value in synthesizing high quality speech. Although they exhibit impressive audio quality, the extent of their semantic capabilities is unknown, and controlling their synthesized speech's vocal properties remains a challenge. Inspired by recent advances in image synthesis, we explore the latent space of frozen TTS models, which is composed of the latent bottleneck activations of the DDM's denoiser. We identify that this space contains rich semantic information, and outline several novel methods for finding semantic directions within it, both supervised and unsupervised. We then demonstrate how these enable off-the-shelf audio editing, without any further training, architectural changes or data requirements. We present evidence of the semantic and acoustic qualities of the edited audio, and provide supplemental samples: https://latent-analysis-grad-tts.github.io/speech-samples/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Sai Satya Vamsi Karthik Bhamidipati. 2023. multi-task-speech-classification. https://github.com/karthikbhamidipati/multi-task-speech-classification.
  2. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713.
  3. Wavegrad 2: Iterative refinement for text-to-speech synthesis. arXiv preprint arXiv:2106.09660.
  4. Emodiff: Intensity controllable emotional text-to-speech with soft-label guidance. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  5. Prompttts: Controllable text-to-speech with text descriptions. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  6. Discovering interpretable directions in the semantic latent space of diffusion models.
  7. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851.
  8. Fastdiff: A fast conditional diffusion model for high-quality speech synthesis. arXiv preprint arXiv:2204.09934.
  9. Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2595–2605.
  10. Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics, 71:1–15.
  11. Diff-tts: A denoising diffusion model for text-to-speech. arXiv preprint arXiv:2104.01409.
  12. Guided-tts 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data. arXiv preprint arXiv:2205.15370.
  13. Stochastic differential equations. Springer.
  14. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems, volume 33, pages 17022–17033. Curran Associates, Inc.
  15. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761.
  16. Diffusion models already have a semantic latent space. arXiv preprint arXiv:2210.10960.
  17. Diffgan-tts: High-fidelity and efficient text-to-speech with denoising diffusion gans. arXiv preprint arXiv:2201.11972.
  18. Grad-tts: A diffusion probabilistic model for text-to-speech. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8599–8608. PMLR.
  19. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer.
  20. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116.
  21. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR.
  22. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456.
  23. Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83.
  24. Libritts: A corpus derived from librispeech for text-to-speech. In Interspeech.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.