DiffWave: A Versatile Diffusion Model for Audio Synthesis

Published 21 Sep 2020 in eess.AS, cs.CL, cs.LG, cs.SD, and stat.ML | (2009.09761v3)

Abstract: In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

Abstract PDF Upgrade to Chat

Citations (1,209)

View on Semantic Scholar

Summary

The paper introduces DiffWave, a non-autoregressive diffusion model that efficiently synthesizes high-quality audio surpassing traditional autoregressive and GAN-based methods.
It leverages a bidirectional dilated convolution architecture with diffusion-step embeddings to transform noise into structured waveforms.
Experimental results demonstrate competitive Mean Opinion Scores and improved diversity in both neural vocoding and unconditional generation tasks.

DiffWave: A Versatile Diffusion Model for Audio Synthesis

This essay provides an examination of "DiffWave," a novel diffusion probabilistic model for audio synthesis presented by Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. The paper introduces a versatile approach for both conditional and unconditional waveform generation, leveraging diffusion models to address inherent challenges in speech synthesis.

Introduction

The paper situates DiffWave within the broader context of deep generative models used for high-fidelity audio synthesis. Prior efforts predominantly utilized likelihood-based models such as autoregressive models (e.g., WaveNet) and flow-based models (e.g., WaveGlow, Flowavenet). These traditional approaches encounter challenges, especially in unconditional audio generation, where autoregressive models often produce subpar outputs.

Diffusion probabilistic models, which employ a Markov Chain to iteratively transform a simple Gaussian distribution into a complex data distribution, present a promising alternative. The authors propose DiffWave, a non-autoregressive model optimized through variational inference, to achieve efficient and high-quality waveform generation.

Methodology

DiffWave operates by converting a white noise signal into a structured waveform across a fixed number of synthesis steps. This process entails both a diffusion process, which progressively adds noise to the data, and a reverse process, which eliminates noise to reconstruct the waveform. A significant strength of DiffWave lies in its ability to parallelize synthesis, unlike autoregressive models.

Diffusion Probabilistic Models

The authors detail the theoretical framework underpinning diffusion probabilistic models. The diffusion process is fixed and non-parametric, which avoids the complexity and instability of joint training encountered in GANs and VAEs. The reverse process converts back to the data distribution using parameterized functions, optimized via the Evidence Lower Bound (ELBO).

Architecture

DiffWave adopts a feed-forward, bidirectional dilated convolution architecture inspired by WaveNet but without its autoregressive constraints. The model is composed of multiple residual layers, each incorporating diffusion-step embeddings to ensure the network can adaptively process varying levels of noise.

For conditional generation, such as neural vocoding, DiffWave employs upsampled mel spectrograms as local conditioners and global discrete labels when necessary. This flexibility in handling different types of conditional information underpins much of DiffWave’s versatility.

Experimental Evaluation

The paper presents exhaustive experiments to benchmark DiffWave against state-of-the-art models on various tasks:

Neural Vocoding

Using the LJ Speech dataset, the authors compare DiffWave with models like WaveNet, ClariNet, WaveFlow, and WaveGlow. The evaluation, using MOS (Mean Opinion Scores), reveals that DiffWave achieves comparable or superior audio quality while synthesizing orders of magnitude faster than autoregressive counterparts.

Unconditional Generation

On the SC09 dataset, DiffWave significantly outperforms autoregressive models (e.g., WaveNet) and GAN-based models (e.g., WaveGAN) in both sample diversity and audio quality. Automatic evaluation metrics like FID, IS, and AM scores corroborate these findings, highlighting DiffWave's ability to capture complex data variations without conditional inputs.

Class-Conditional Generation

DiffWave also excels in class-conditional generation tasks on the SC09 dataset. The model shows higher classification accuracy and within-class diversity (measured by mIS) compared to autoregressive models.

Additional Experiments

Further experiments illustrate DiffWave's potential in zero-shot speech denoising and latent space interpolation, underscoring the model's robustness and adaptability to diverse audio synthesis tasks.

Implications and Future Work

The paper articulates several key implications:

Parallel Synthesis: DiffWave’s non-autoregressive nature enables efficient parallel synthesis, making it viable for real-time applications.
Flexibility: The model’s ability to handle both conditional and unconditional tasks without architectural changes positions it as a versatile tool in the field of audio synthesis.
Scalability: Future work could optimize inference speed, potentially employing smaller diffusion steps or exploring hardware-specific optimizations like persistent kernels on GPUs.

Conclusion

DiffWave represents a significant stride in audio synthesis, merging diffusion probabilistic models with efficient neural architectures to deliver high-quality audio across a spectrum of tasks. Its successful marriage of theory and practical performance indicates a promising direction for future research and application in speech synthesis and related domains.

Markdown Report Issue