WaveGlow: A Flow-based Generative Network for Speech Synthesis

Published 31 Oct 2018 in cs.SD, cs.AI, cs.LG, eess.AS, and stat.ML | (1811.00002v1)

Abstract: In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable. Our PyTorch implementation produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation. All code will be made publicly available online.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (991)

View on Semantic Scholar

Summary

The paper presents a novel flow-based model combining Glow and WaveNet that achieves high-quality speech synthesis while simplifying training and inference.
The paper demonstrates that WaveGlow attains superior mean opinion scores and real-time synthesis speeds compared to traditional auto-regressive models.
The paper highlights the model’s potential for broader applications, setting the stage for future research in multilingual and non-speech high-fidelity generative modeling.

WaveGlow: A Flow-based Generative Network for Speech Synthesis

The paper "WaveGlow: A Flow-based Generative Network for Speech Synthesis" presents a novel approach to generating high-quality speech from mel-spectrograms. The proposal encompasses a flow-based network named WaveGlow, which integrates insights from both Glow and WaveNet to facilitate rapid, efficient, and high-quality audio synthesis. Unlike traditional auto-regressive models, WaveGlow does not require auto-regression, which significantly simplifies both training and inference pipelines.

Technical Contributions

The primary contribution of WaveGlow lies in its innovative architecture, which combines the generative framework of Glow with the efficient, high-quality audio generation capabilities of WaveNet. WaveGlow utilizes a single network architecture trained solely on the negative log-likelihood of the data. This unifies the training procedure into a straightforward and stable process. Key architectural components include:

Flow-based Generative Model:
- The model samples from a zero mean spherical Gaussian distribution and transforms these samples through multiple layers into the desired audio distribution.
- The architecture ensures invertibility at each layer, thus allowing the likelihood to be computed directly using a change of variables.
Affine Coupling Layers:
- The network uses affine coupling layers, where half the channels serve as inputs to produce multiplicative and additive terms that scale and translate the remaining channels.
- This design maintains the invertibility of the overall network, enabling efficient forward and backward passes during training and inference.
1x1 Invertible Convolutions:
- To mix information across channels, the authors incorporate invertible 1x1 convolution layers before each affine coupling layer.
- The orthonormal initialization of these weights ensures invertibility, and their log-determinants are included in the loss function to maintain mathematical integrity.
Early Outputs:
- For better gradient propagation, WaveGlow outputs part of the audio dimensions early in the network. This strategy ensures a more effective utilization of hierarchical representations.

Experiments and Results

The authors conducted experiments using the LJ Speech dataset, which comprises around 24 hours of high-quality speech data. Two baseline models were employed for comparison: Griffin-Lim and a standard implementation of WaveNet. The performance evaluation included Mean Opinion Score (MOS) tests and synthesis speed assessments.

Mean Opinion Scores (MOS):
- Griffin-Lim: 3.823 ± 0.1349
- WaveNet: 3.885 ± 0.1238
- WaveGlow: 3.961 ± 0.1343
- Ground Truth: 4.274 ± 0.1340

WaveGlow’s MOS indicated a superior audio quality close to the Ground Truth and slightly better than the WaveNet baseline, albeit with non-sensational differences.

Inference Speed:
- Griffin-Lim: 507 kHz
- WaveNet: 0.11 kHz
- WaveGlow: 520 kHz on an NVIDIA V100 GPU

WaveGlow achieved synthesis speeds of approximately 520 kHz, showcasing a significant advantage in terms of real-time processing capabilities compared to the auto-regressive models.

Discussion and Implications

The research delineates the distinction between auto-regressive and non-auto-regressive models in speech synthesis. While auto-regressive models like WaveNet have shown excellent performance, they suffer from slower inference speeds due to their inherently sequential processing. On the other hand, non-auto-regressive models, such as Parallel WaveNet, ClariNet, and the proposed WaveGlow, offer parallelism during inference, drastically accelerating the generation process.

WaveGlow’s architecture, which leverages the flow-based approach of Glow while incorporating the structure of WaveNet, eliminates the need for complex training procedures like those required by Parallel WaveNet and ClariNet. This simplification potentially leads to more accessible and deployable high-quality audio synthesis systems. The demonstrated synthesis speed and quality make WaveGlow a valuable addition to the field of speech synthesis.

Future Directions

The impact of this work extends beyond speech synthesis. Future research could explore:

Enhancing the model’s robustness across diverse datasets and multilingual capabilities.
Further optimizing the synthesis speed through hardware-specific advancements and refined software implementations.
Investigating the applicability of WaveGlow to other domains requiring high-fidelity and efficient generative modeling, such as music synthesis or real-time audio processing in virtual environments.

In conclusion, WaveGlow marks significant progress in achieving high-quality, fast, and efficient speech synthesis. Its simplicity in training and impressive performance metrics underscore the promising potential for broader applications and further explorations in generative audio models.

Markdown Report Issue