Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

Published 24 Oct 2017 in cs.SD, cs.AI, cs.LG, and eess.AS | (1710.08969v2)

Abstract: This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without use of any recurrent units. Recurrent neural networks (RNN) have become a standard technique to model sequential data recently, and this technique has been used in some cutting-edge neural TTS techniques. However, training RNN components often requires a very powerful computer, or a very long time, typically several days or weeks. Recent other studies, on the other hand, have shown that CNN-based sequence synthesis can be much faster than RNN-based techniques, because of high parallelizability. The objective of this paper is to show that an alternative neural TTS based only on CNN alleviate these economic costs of training. In our experiment, the proposed Deep Convolutional TTS was sufficiently trained overnight (15 hours), using an ordinary gaming PC equipped with two GPUs, while the quality of the synthesized speech was almost acceptable.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (264)

View on Semantic Scholar

Summary

The paper presents a CNN-based TTS model with guided attention that drastically reduces training time while maintaining competitive audio quality.
It introduces a dual-module architecture—Text2Mel and SSRN—that converts text to mel-spectrograms and refines them into full spectrograms for waveform synthesis.
Experimental results on the LJ Speech Dataset demonstrate that DCTTS achieves comparable MOS scores in just 15 hours using a dual-GPU setup.

Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks With Guided Attention

The paper under discussion presents an innovative approach to text-to-speech (TTS) synthesis based solely on deep convolutional neural networks (CNNs), without involving any recurrent neural network (RNN) components. Traditionally, RNNs have been the preferred choice for sequential data modeling given their ability to process sequences over time. However, their computational demands are significant due to limited parallelization capabilities, often requiring extensive hardware resources or longer training times. This study proposes the Deep Convolutional TTS (DCTTS) method as an alternative to address these training inefficiencies.

Core Contributions and Methodology

The primary contribution of this paper is twofold: First, it introduces a fully CNN-based TTS model that offers competitive quality of synthesized speech while substantially reducing training time compared to RNN-based models like Tacotron. Second, the study puts forward a novel technique termed "guided attention," which expedites the training of the attention mechanism by guiding alignment in a more effective manner.

The proposed architecture consists of two interconnected modules: the Text-to-Mel Network (Text2Mel) and the Spectrogram Super-Resolution Network (SSRN). Text2Mel synthesizes mel-spectrograms from textual inputs, exploiting guided attention to align the text sequence with the audio frames. The SSRN then refines these mel-spectrograms into full spectrograms suitable for vocoder-based waveform synthesis. Notably, the authors leverage dilated convolutions to encapsulate long-term dependencies in the sequence without relying on RNNs, enabling parallel processing and thus faster training.

Experimental Results

This work details an empirical evaluation using the LJ Speech Dataset. By utilizing a dual-GPU setup on a typical gaming PC, the DCTTS model was sufficiently trained in approximately 15 hours, reaching Mean Opinion Scores (MOS) comparable to or exceeding those of open implementations of Tacotron, which require significantly longer training periods. The model showed a MOS of 2.71 after 15 hours of training, suggesting promising potential in terms of rapid deployment and satisfactory audio quality.

Implications and Future Directions

The implications of this research are noteworthy for the field of TTS, particularly in reducing the barrier to entry for smaller teams and individuals lacking access to extensive computational resources. Through the introduction of CNN-only architectures in TTS, it opens up possibilities for real-time, on-device applications due to better computational efficiency and reduced memory requirements.

The paper forecasts several avenues for future research. These include exploring hyperparameter optimizations and integrating recent advances in deep learning to further enhance the synthesized audio quality. Moreover, the adaptability of the CNN-based TTS approach can be extended beyond standard speech synthesis to more personalized or affective speech synthesis tasks. There is also potential for further integration into multimodal systems, leveraging the reduced computational overhead.

In conclusion, this work advocates for a shift towards convolutional architectures in TTS systems, providing a robust framework that balances performance and resource utilization. The proposed methods hold promise for broader application and innovation within the AI and natural language processing communities.

Markdown Report Issue