Tacotron: Towards End-to-End Speech Synthesis

Published 29 Mar 2017 in cs.CL, cs.LG, and cs.SD | (1703.10135v2)

Abstract: A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

Abstract PDF Upgrade to Chat

Authors (14)

Citations (1,749)

View on Semantic Scholar

Summary

The paper introduces an end-to-end TTS model that consolidates multi-stage processes into a single, trainable framework.
Tacotron employs a novel CBHG module with an attention-based seq2seq approach to capture both local and global context in text.
Evaluations show the model achieves a MOS of 3.82, outperforming traditional systems while reducing engineering complexity.

Tacotron: Towards End-to-End Speech Synthesis

The paper introduces Tacotron, an end-to-end generative text-to-speech (TTS) model, marking a significant development in the field of TTS. Traditional text-to-speech systems typically rely on a modular approach with distinct stages such as text analysis, acoustic modeling, and audio synthesis. These stages are often complex, requiring extensive domain expertise and significant engineering effort. Tacotron circumvents these challenges by synthesizing speech directly from characters, trained purely from paired text and audio data without the need for phoneme-level alignment.

Key Contributions and Techniques

The primary contribution of Tacotron is its ability to generate speech from text in a single, integrated model. The model architecture is informed by the sequence-to-sequence (seq2seq) framework with attention mechanisms, which have been successful in other domains such as machine translation and speech recognition. Tacotron adapts these concepts to the unique challenges of TTS, where output sequences are significantly longer and more variable than input sequences.

Tacotron employs a novel module called CBHG (Convolutional Bank + Highway Network + Bidirectional GRU) to transform input sequences through a blend of convolutional layers, highway networks, and GRUs. This architecture allows Tacotron to capture both local and global context from the input text, leading to more accurate and natural-sounding speech synthesis.

Model Architecture

The architecture consists of three main components:

Encoder: The encoder converts character sequences into a robust sequential representation using a combination of embeddings, a pre-net with dropout, and the CBHG module. This setup reduces overfitting and enhances the model's ability to generalize to new text inputs.
Decoder: The decoder generates spectrogram frames from the encoded text representation using a content-based tanh attention mechanism. A unique aspect of the decoder is its design to predict multiple, non-overlapping frames at each step, which not only reduces the number of decoder steps but also accelerates convergence.
Post-Processing Network: To convert the initial spectrogram prediction into a waveform, Tacotron uses a CBHG-based post-processing network followed by Griffin-Lim algorithm. This network refines the spectrogram, enhancing features like harmonics and formants for better audio quality.

Evaluation and Results

Tacotron was evaluated on a dataset of 24.6 hours of speech from a professional female speaker. The model achieved a mean opinion score (MOS) of 3.82 for US English, surpassing a production parametric TTS system's score of 3.69. This performance is particularly notable given that Tacotron avoids the usage of hand-engineered features and complex preprocessing steps required by conventional systems.

The authors performed ablation studies to assess individual components' contributions. These studies highlighted the significant improvements brought by the CBHG encoder over a standard GRU encoder and the benefits of the post-processing network in refining the spectrogram output.

Implications and Future Directions

The successful implementation of Tacotron demonstrates the viability of end-to-end approaches in TTS, offering several practical advantages over traditional systems:

Reduction in Engineering Effort: By eliminating the need for handcrafted features and modular components, Tacotron simplifies the development and deployment of TTS systems.
Adaptability: The end-to-end nature allows Tacotron to learn directly from data, making it easier to train on diverse and noisy datasets found in real-world applications.
Improved Naturalness and Speed: Tacotron presents robust numerical results with a high MOS and faster inference times due to its frame-level generation.

Future research could focus on improving specific components like the Griffin-Lim waveform synthesis, which, while effective, can introduce artifacts. The exploration of neural-network-based spectrogram inversion methods could further enhance audio quality. Additionally, enhancements in the attention module and the development of more sophisticated loss functions might yield further improvements in performance and naturalness.

In conclusion, Tacotron represents a significant advancement in end-to-end TTS models, demonstrating the potential of seq2seq frameworks with attention mechanisms in generating high-quality speech directly from text. Its architecture alleviates many of the challenges associated with traditional TTS systems, providing a foundation for future innovations in the field.

Markdown Report Issue