Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Published 16 Dec 2017 in cs.CL | (1712.05884v2)

Abstract: This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of $4.53$ comparable to a MOS of $4.58$ for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and $F_0$ features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.

Abstract PDF Upgrade to Chat

Citations (2,540)

View on Semantic Scholar

Summary

The paper introduces Tacotron 2, which conditions WaveNet on mel spectrogram predictions to produce natural, high-quality synthesized speech.
It employs a sequence-to-sequence model for text-to-mel conversion and a modified vocoder that achieves a mean opinion score of 4.53, closely matching professional recordings.
Ablation experiments highlight that using mel spectrograms simplifies training and enhances audio fidelity compared to traditional vocoding methods.

Overview of "Natural TTS Synthesis By Conditioning WaveNet On Mel Spectrogram Predictions"

This paper presents a novel neural network architecture for text-to-speech (TTS) synthesis called Tacotron 2. The system amalgamates a sequence-to-sequence model for predicting mel spectrograms from text with a modified WaveNet vocoder to produce highly natural speech. This integration demonstrates significant improvements in audio quality and synthesis naturalness over prior TTS systems.

Introduction and Background

The paper positions Tacotron 2 within the historical context of TTS systems, which have evolved from concatenative synthesis methods to statistical parametric approaches, and most recently, to deep learning models like WaveNet. Both concatenative and parametric methods have limitations, such as unnatural transitions and muffled audio quality, respectively. The introduction of WaveNet, a probabilistic model generating time-domain waveforms, marked substantial progress in TTS audio quality. However, WaveNet's dependency on hand-crafted linguistic and acoustic features poses challenges for scalability and adaptability.

Tacotron, another relevant work, introduced a sequence-to-sequence architecture generating spectrograms from text, simplifying the traditional speech synthesis pipeline. However, the use of the Griffin-Lim algorithm for vocoding in Tacotron resulted in lower audio quality. Tacotron 2 integrates these advancements by employing a sequence-to-sequence model to predict mel spectrograms, which are then synthesized into waveforms by a WaveNet vocoder, merging the benefits of both approaches.

Model Architecture

Tacotron 2 consists of two primary components:

Spectrogram Prediction Network: A recurrent sequence-to-sequence model with attention that converts input character sequences to mel spectrograms. This network comprises an encoder, which transforms input text into hidden feature representations, and an autoregressive decoder, which uses these representations to predict mel spectrograms.
WaveNet Vocoder: A modified WaveNet model that generates time-domain waveform samples conditioned on the mel spectrogram frames produced by the first component. This model significantly enhances audio quality by leveraging a mixture of logistic distributions and a reduced number of conditioning upsampling layers.

Using mel spectrograms as an intermediate representation between these components is a pivotal design choice. This representation simplifies the training process, as mel spectrograms are smoother and less phase-sensitive than raw audio waveforms.

Experimental Results

The authors provide a comprehensive evaluation of Tacotron 2 through both intrinsic and extrinsic experiments.

Mean Opinion Score (MOS): The main subjective evaluation metric, MOS, demonstrated that Tacotron 2 achieves a score of 4.53, closely approaching the MOS of 4.58 for professionally recorded speech. This surpasses the scores of other TTS systems significantly, including the original Tacotron with Griffin-Lim and even previous versions of WaveNet conditioned on linguistic features.
Comparison Studies: Tacotron 2's audio quality was found to be statistically indistinguishable from human speech. In a side-by-side evaluation, human raters slightly preferred the ground truth over Tacotron 2, with a mean score of -0.270. Still, the difference was minimal, indicating the system's high performance.
Ablation Studies: Various ablation experiments underscored the importance of the model's design choices. Notably, using mel spectrograms instead of linear spectrograms or excluding the post-processing network led to reduced performance. Additionally, a simplified WaveNet configuration with fewer layers and smaller receptive fields still maintained high audio quality, showcasing the efficiency of the mel spectrogram representation.

Implications and Future Directions

Tacotron 2 represents a significant advancement in neural TTS systems due to its integration of sequence-to-sequence models and high-fidelity neural vocoders. Practically, this system offers potential applications in various domains requiring natural speech synthesis, such as virtual assistants, audiobooks, and accessibility tools for individuals with disabilities.

Theoretically, the success of Tacotron 2 highlights the effectiveness of end-to-end neural approaches in TTS tasks, suggesting further exploration in minimizing reliance on feature engineering. Future research could investigate improvements in prosody modeling and better generalization to out-of-domain text, addressing observed limitations like occasional mispronunciations and unnatural prosody.

Further developments may include:

Enhancing the robustness of speech synthesis in the presence of diverse and complex input text.
Expanding the model to support multiple languages and dialects.
Exploring alternatives to mel spectrograms for even more compact or efficient intermediate representations.

Conclusion

Tacotron 2 is a sophisticated TTS system that bridges innovations in sequence-to-sequence modeling and neural vocoding, achieving near-human quality in synthesized speech. The paper's thorough experimentation and validation affirm the model's capability to produce natural, intelligible, and high-quality audio, marking a noteworthy step in the field of speech synthesis.

Markdown Report Issue