Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

Published 25 Oct 2019 in eess.AS, cs.LG, cs.SD, and eess.SP | (1910.11480v2)

Abstract: We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method does not require density distillation used in the conventional teacher-student framework, the entire model can be easily trained. Furthermore, our model is able to generate high-fidelity speech even with its compact architecture. In particular, the proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment. Perceptual listening test results verify that our proposed method achieves 4.16 mean opinion score within a Transformer-based text-to-speech framework, which is comparative to the best distillation-based Parallel WaveNet system.

Abstract PDF Upgrade to Chat

Citations (761)

View on Semantic Scholar

Summary

The paper introduces a distillation-free training method that uses joint multi-resolution STFT and adversarial losses to streamline the synthesis process.
It employs a compact architecture with only 1.44 million parameters, achieving 28.68x real-time synthesis speed on a single high-end GPU.
The model delivers competitive perceptual quality with a MOS of 4.16, demonstrating its promise for resource-constrained speech synthesis applications.

Parallel WaveGAN: A Fast Waveform Generation Model Based on GANs with Multi-resolution Spectrogram

The paper presents Parallel WaveGAN, an innovative approach to waveform generation that leverages the power of Generative Adversarial Networks (GANs) combined with multi-resolution spectrogram analysis. This method is designed to provide a fast, small-footprint, and high-fidelity solution for speech synthesis without the need for the distillation processes traditionally used in teacher-student frameworks such as Parallel WaveNet.

Key Contributions and Methodology

The authors introduce several technical innovations and improvements over existing methods:

Distillation-Free Training: The Parallel WaveGAN eschews the complex probability density distillation required in teacher-student frameworks. This simplification significantly reduces the training time by allowing the model to be trained end-to-end.
Joint Training Approach: The proposed method combines multi-resolution Short-Time Fourier Transform (STFT) loss with an adversarial loss. This dual optimization strategy helps the model effectively capture the time-frequency distribution of speech waveforms. By employing non-autoregressive WaveNet as the generator and a discriminator network, the model learns to generate realistic speech by mimicking the underlying distribution of genuine speech signals.
Efficient Architecture: The model maintains a compact architecture with only 1.44 million parameters, which contrasts with larger, more computationally intensive models. Remarkably, the model can produce 24 kHz speech 28.68 times faster than real-time on a single NVIDIA V100 GPU.

Experimental Setup and Results

Dataset and Model Details

The experiments employed a dataset consisting of 23.09 hours of speech from a single Japanese female speaker, with additional data for validation and evaluation. The speech signals were resampled at 24 kHz, and 80-band log-mel spectrograms were used as auxiliary features for conditioning.

The Parallel WaveGAN was configured with 30 layers of dilated residual convolution blocks. Training employed RAdam optimizer with an initial learning rate of 0.0001 for the generator and 0.00005 for the discriminator, further utilizing multi-resolution STFT losses to enhance the training process.

Performance Metrics

The evaluation encompassed both perceptual quality, measured by Mean Opinion Scores (MOS), and computational efficiency. Results indicated that Parallel WaveGAN achieved a MOS of 4.16 within a Transformer-based TTS framework, which is competitive with the best distillation-based systems such as ClariNet, which achieved a MOS of 4.21.

Theoretical and Practical Implications

Implications for Speech Synthesis

The most notable practical implication of Parallel WaveGAN is its capacity to synthesize high-fidelity speech in real-time without the cumbersome training processes dictated by traditional distillation methods. Its compact architecture makes it highly suitable for deployment in resource-constrained environments, such as mobile or embedded systems.

On a theoretical level, the integration of multi-resolution STFT loss with adversarial training provides a robust framework for capturing the dynamic nature of speech signals. This approach mitigates overfitting to specific frequency bands and improves the fidelity of generated waveforms across the entire frequency spectrum.

Future Directions

Potential future work could explore the enhancement of the multi-resolution STFT auxiliary loss by incorporating phase-related loss components to better capture the nuanced characteristics of speech signals. Additionally, broadening the scope to include diverse and expressive speech corpora could further validate the robustness and generality of the Parallel WaveGAN architecture.

Conclusion

Parallel WaveGAN represents a significant advancement in the domain of neural vocoders for speech synthesis. By avoiding distillation and leveraging a joint training approach with multi-resolution spectrogram analysis, this method delivers a practical, efficient, and high-quality solution to waveform generation. The encouraging results in both speed and perceptual quality underscore its potential utility in various real-world speech synthesis applications.

Markdown