ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech (1807.07281v3)

Published 19 Jul 2018 in cs.CL, cs.AI, cs.LG, cs.SD, and eess.AS

Abstract: In this work, we propose a new solution for parallel wave generation by WaveNet. In contrast to parallel WaveNet (van den Oord et al., 2018), we distill a Gaussian inverse autoregressive flow from the autoregressive WaveNet by minimizing a regularized KL divergence between their highly-peaked output distributions. Our method computes the KL divergence in closed-form, which simplifies the training algorithm and provides very efficient distillation. In addition, we introduce the first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end-to-end training from scratch. It significantly outperforms the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet (Ping et al., 2018). We also successfully distill a parallel waveform synthesizer conditioned on the hidden representation in this end-to-end model.

Citations (335)

View on Semantic Scholar

Summary

The paper introduces a Gaussian output distribution that models raw waveforms in WaveNet through maximum likelihood training.
It presents an efficient distillation framework that minimizes a closed-form regularized KL divergence for stable Gaussian IAF generation.
The end-to-end parallel architecture enables high-quality text-to-speech synthesis with performance comparable to autoregressive vocoders.

Overview of ClariNet: Parallel Wave Generation for Text-to-Speech

ClariNet introduces a novel approach to parallel wave generation for text-to-speech (TTS) synthesis, leveraging a technique distinct from its predecessors like Parallel WaveNet. The research outlines an innovative knowledge distillation framework where a Gaussian inverse autoregressive flow (IAF) is distilled from an autoregressive WaveNet, utilizing a regularized KL divergence calculated in closed-form, hence optimizing and simplifying the training process.

Contributions and Architecture

The primary contributions of this research are manifold:

Gaussian Output Distribution in WaveNet: The paper establishes that a single variance-bounded Gaussian suffices for modeling raw waveforms in WaveNet, contravening the often used mixture models. The WaveNet is trained via maximum likelihood estimation (MLE), bypassing the complexity of quantized surrogate losses.
Efficient Distillation Process: By minimizing a closed-form computed regularized KL divergence, the distillation of Gaussian IAF from an autoregressive WaveNet becomes stable and efficient. This approach notably circumvents the instabilities often encountered with Monte Carlo approximation methods used in previous models.
End-to-End Text-to-Wave Architecture: ClariNet heralds an end-to-end neural architecture for TTS that eliminates the need for separate waveform synthesis stages. This fully convolutional design enables faster training processes and directly synthesizes waveforms from text inputs.
Parallel Neural Vocoder: The methodology also facilitates the distillation of a parallel waveform synthesizer from the hidden representations within the end-to-end architecture. The performance of this parallel vocoder is on par with that of an autoregressive vocoder.

Strong Numerical Results

Empirical evaluations affirm the efficacy of the proposed methods. The Gaussian waveform model achieves high fidelity in synthesized speech comparable to methods using more complex or larger output spaces like mixture models or high-dimensional softmax setups. On the subjective Mean Opinion Score (MOS) scale, this model attains high ratings close to natural human speech, thus asserting its capability in real-world applications.

Implications and Future Directions

Practically, this research offers significant advancements in reducing the computational demands of TTS systems, showcasing the potential for efficient real-time speech synthesis without compromising audio quality. Theoretically, it challenges prevailing assumptions about model complexity needed for high-fidelity waveform generation.

Future directions could explore the integration of perceptual losses into the framework for further enhancing audio realism. Additionally, the application of this parallel and efficient architecture to other languages or dialects could broaden its applicability. Insights from this research might also inform developments in non-autoregressive models for other sequential data tasks, encouraging exploration beyond traditional autoregressive constraints.

In conclusion, ClariNet represents a critical step forward in the effort to harmonize high-quality speech synthesis with computational efficiency, inviting both theoretical exploration and practical implementations across various domains of artificial intelligence.

PDF Markdown