Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation (2309.10740v3)

Published 19 Sep 2023 in cs.SD, cs.LG, cs.MM, and eess.AS

Abstract: Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space and incorporates classifier-free guidance (CFG) into model training. Moreover, unlike diffusion models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware metrics, such as CLAP score, to further enhance the generations. Our objective and subjective evaluation on the AudioCaps dataset shows that compared to diffusion-based counterparts, ConsistencyTTA reduces inference computation by 400x while retaining generation quality and diversity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. “Text-to-audio generation using instruction-tuned llm and latent diffusion model,” arXiv preprint arXiv:2304.13731, 2023.
  2. “Diffsound: Discrete diffusion model for text-to-sound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  3. “AudioLDM: Text-to-audio generation with latent diffusion models,” arXiv preprint arXiv:2301.12503, 2023.
  4. “AudioLDM 2: Learning holistic audio generation with self-supervised pretraining,” arXiv preprint arXiv:2308.05734, 2023.
  5. “Make-an-Audio: Text-to-audio generation with prompt-enhanced diffusion models,” arXiv preprint arXiv:2301.12661, 2023.
  6. “Make-an-Audio 2: Temporal-enhanced text-to-audio generation,” arXiv preprint arXiv:2305.18474, 2023.
  7. “Any-to-any generation via composable diffusion,” arXiv preprint arXiv:2305.11846, 2023.
  8. “AudioGen: Textually guided audio generation,” in International Conference on Learning Representations, 2023.
  9. “Riffusion - stable diffusion for real-time music generation,” URL https://riffusion.com, 2022.
  10. “High-resolution image synthesis with latent diffusion models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  11. “Consistency models,” in International Conference on Machine Learning, 2023.
  12. “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  13. “Deep unsupervised learning using nonequilibrium thermodynamics,” in International Conference on Machine Learning, 2015.
  14. “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, 2020.
  15. “Elucidating the design space of diffusion-based generative models,” in Advances in Neural Information Processing Systems, 2022.
  16. “Noise2Music: Text-conditioned music generation with diffusion models,” arXiv preprint arXiv:2302.03917, 2023.
  17. “U-Net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention, 2015.
  18. “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
  19. Leonhard Euler, Institutionum calculi integralis, vol. 1, impensis Academiae imperialis scientiarum, 1824.
  20. “DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps,” in Advances in Neural Information Processing Systems, 2022.
  21. “DPM-solver++: Fast solver for guided sampling of diffusion probabilistic models,” arXiv preprint arXiv:2211.01095, 2022.
  22. “Pseudo numerical methods for diffusion models on manifolds,” in International Conference on Learning Representations, 2022.
  23. “Progressive distillation for fast sampling of diffusion models,” in International Conference on Learning Representations, 2021.
  24. “On distillation of guided diffusion models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  25. “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems, 2020.
  26. “Efficient diffusion training via min-snr weighting strategy,” arXiv preprint arXiv:2303.09556, 2023.
  27. “CLAP: learning audio concepts from natural language supervision,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
  28. “AudioCaps: Generating captions for audios in the wild,” in Conference of the North American Chapter of the Association for Computational Linguistics, 2019.
  29. “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  30. “CNN architectures for large-scale audio classification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017.
  31. “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
  32. “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
  33. “Audio set: An ontology and human-labeled dataset for audio events,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017.
  34. Brian McFee, “ResamPy: efficient sample rate conversion in python,” Journal of Open Source Software, vol. 1, no. 8, pp. 125, 2016.
  35. “TorchAudio: Building blocks for audio and speech processing,” arXiv preprint arXiv:2110.15018, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yatong Bai (10 papers)
  2. Trung Dang (17 papers)
  3. Dung Tran (13 papers)
  4. Kazuhito Koishida (22 papers)
  5. Somayeh Sojoudi (70 papers)
Citations (14)

Summary

We haven't generated a summary for this paper yet.