BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Published 25 Mar 2022 in eess.AS, cs.AI, cs.LG, cs.SD, and eess.SP | (2203.13508v1)

Abstract: Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surrogate objective can achieve a lower bound of the log marginal likelihood tighter than a conventional surrogate. We also find that BDDM allows inheriting pre-trained score network parameters from any DPMs and consequently enables speedy and stable learning of the schedule network and optimization of a noise schedule for sampling. Our experiments demonstrate that BDDMs can generate high-fidelity audio samples with as few as three sampling steps. Moreover, compared to other state-of-the-art diffusion-based neural vocoders, BDDMs produce comparable or higher quality samples indistinguishable from human speech, notably with only seven sampling steps (143x faster than WaveGrad and 28.6x faster than DiffWave). We release our code at https://github.com/tencent-ailab/bddm.

Abstract PDF Upgrade to Chat

Citations (80)

View on Semantic Scholar

Summary

The paper introduces Bilateral Denoising Diffusion Models (BDDMs), which parameterize both forward and reverse processes with a bilateral objective for improved speech synthesis.
BDDMs achieve high-fidelity audio using as few as three sampling steps, demonstrating significantly faster performance compared to previous diffusion-based models.
The framework allows inheriting pre-trained parameters and includes a public code release, enhancing learning stability, adaptability, and reproducibility.

An Expert Review of the BDDM Framework for Speech Synthesis

The paper "BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis" presents a novel approach to generative models with a focus on improving speech synthesis through diffusion probabilistic models (DPMs). This work introduces Bilateral Denoising Diffusion Models (BDDMs), which promise enhanced sampling efficiency and quality in audio generation, specifically targeting neural vocoding tasks.

Core Contributions

BDDMs distinguish themselves by parameterizing both the forward and reverse processes using a schedule network and a score network. This bilateral modeling enables high-quality audio synthesis with significantly reduced sampling steps compared to traditional diffusion models. The main contributions include:

Bilateral Objective and Model Architecture: The paper proposes a bilateral framework wherein both the forward process (noise schedule) and reverse process (denoising) are parameterized separately. It introduces a new bilateral modeling objective that results in a tighter lower bound on the log marginal likelihood compared to conventional surrogate models.
Innovative Training Approach: The model allows pre-trained parameters of score networks from existing DPMs to be inherited. This feature facilitates quick and stable learning of the schedule network and optimization of noise schedules needed for sampling—a crucial aspect for reducing generational delay.
Efficient Sampling: The experiments highlight the model's capability to produce high-fidelity audio samples using as few as three sampling steps. BDDMs maintain comparable or superior sound quality to state-of-the-art diffusion-based neural vocoders with a dramatic reduction in computational time (143x faster than WaveGrad and 28.6x faster than DiffWave).
Public Code Release: The authors have made their implementation available at a public repository, which is a commendable step towards reproducibility and further research exploration.

Numerical Results and Implications

The paper claims substantial improvements in sampling efficiency without compromising on sample quality. Specifically, BDDMs achieve high-fidelity outputs with only three sampling steps. In a broader comparison, BDDMs render audio indistinguishable from human speech using just seven sampling steps. These advancements underpin the potential for BDDMs to be utilized in real-time applications, addressing the common criticism of diffusion models regarding their speed.

The reduction in sampling steps directly impacts the real-time factor, making BDDMs suitable for deployment in time-sensitive environments such as streaming services and live broadcasting. Besides speed, the quality of audio synthesized by BDDMs invites further exploration in various contexts beyond speech, such as music and ambient sound generation.

Theoretical and Practical Implications

On a theoretical level, the establishment of a tighter lower bound on the log likelihood provides a richer understanding of the generative processes involved in diffusion models. It aligns the learning towards more efficient parameter spaces, potentially opening avenues for further refinement in probabilistic modeling.

Practically, the ability to adopt pre-trained networks offers profound benefits for scalability and adaptability of BDDMs. This feature not only accelerates the learning curve for new applications but also expands the versatility of BDDMs to incorporate diverse datasets with minimal retraining.

Prospective Developments

The exploration of BDDMs paves the way for further research into adaptive noise scheduling methods and their applications across different generative tasks. Future developments might focus on enhancing the versatility and scalability of BDDMs by integrating diverse architectures and exploring multi-modal data synthesis. Additionally, given the success in speech synthesis, extending the framework to other forms of sequential data could be a significant area of future investigation.

In conclusion, BDDMs stand as a robust solution to the prevailing bottlenecks of current diffusion models, particularly in speed and quality, marking a significant step forward in high-efficiency generative modeling for speech synthesis.

Markdown Report Issue