FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

Published 21 Apr 2022 in eess.AS, cs.LG, and cs.SD | (2204.09934v1)

Abstract: Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies with adaptive conditions. A noise schedule predictor is also adopted to reduce the sampling steps without sacrificing the generation quality. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms without any intermediate feature (e.g., Mel-spectrogram). Our evaluation of FastDiff demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech samples. Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. We further show that FastDiff generalized well to the mel-spectrogram inversion of unseen speakers, and FastDiff-TTS outperformed other competing methods in end-to-end text-to-speech synthesis. Audio samples are available at \url{https://FastDiff.github.io/}.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (144)

View on Semantic Scholar

Summary

The paper introduces FastDiff, a conditional diffusion model that leverages time-aware convolutions and a noise schedule predictor to reduce sampling steps while maintaining quality.
The FastDiff-TTS synthesizer directly generates high-fidelity speech waveforms without relying on intermediate representations like Mel-spectrograms.
The model achieves 58x faster than real-time performance on an NVIDIA V100 GPU and generalizes well to unseen speakers, with a MOS of 4.28.

The paper "FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis" addresses a significant challenge in the application of denoising diffusion probabilistic models (DDPMs) to speech synthesis. DDPMs have been successful in various generative tasks, but their iterative sampling processes are computationally intensive, limiting their practicality for real-time applications such as speech synthesis.

Key Contributions:

FastDiff Architecture:
- The authors introduce FastDiff, an innovative model that employs a stack of time-aware location-variable convolutions. These convolutions feature diverse receptive field patterns, effectively modeling long-term time dependencies while incorporating adaptive conditions.
Noise Schedule Predictor:
- To enhance efficiency, FastDiff incorporates a noise schedule predictor. This component is crucial in reducing the number of sampling steps without degrading the quality of the generated audio, hence maintaining the model's performance while improving speed.
FastDiff-TTS:
- Based on the FastDiff framework, the authors design FastDiff-TTS, an end-to-end text-to-speech synthesizer. Unlike traditional methods, this synthesizer directly generates high-fidelity speech waveforms without relying on intermediate representations like Mel-spectrograms.
Performance and Evaluation:
- The model achieves impressive results, with a Mean Opinion Score (MOS) of 4.28 for speech quality. It also demonstrates a remarkable sampling speed, achieving 58 times faster than real-time performance on an NVIDIA V100 GPU. This marks a significant advance in making diffusion models viable for real-world speech synthesis applications.
Generalization and Competitiveness:
- FastDiff shows strong generalization capabilities to unseen speakers in the task of Mel-spectrogram inversion. In the field of end-to-end text-to-speech synthesis, FastDiff-TTS outperforms other state-of-the-art methods, highlighting its effectiveness and robustness.

Overall, this paper presents a substantial advancement in speech synthesis technology by overcoming the traditional speed limitations of diffusion models, thus paving the way for their deployment in practical applications. The authors offer audio samples online to demonstrate the capabilities of their model.

Markdown Report Issue