ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit (1910.10909v2)

Published 24 Oct 2019 in cs.CL and eess.AS

Abstract: This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit supports state-of-the-art E2E-TTS models, including Tacotron~2, Transformer TTS, and FastSpeech, and also provides recipes inspired by the Kaldi automatic speech recognition (ASR) toolkit. The recipes are based on the design unified with the ESPnet ASR recipe, providing high reproducibility. The toolkit also provides pre-trained models and samples of all of the recipes so that users can use it as a baseline. Furthermore, the unified design enables the integration of ASR functions with TTS, e.g., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models. This paper describes the design of the toolkit and experimental evaluation in comparison with other toolkits. The experimental results show that our models can achieve state-of-the-art performance comparable to the other latest toolkits, resulting in a mean opinion score (MOS) of 4.25 on the LJSpeech dataset. The toolkit is publicly available at https://github.com/espnet/espnet.

Citations (193)

View on Semantic Scholar

Summary

The paper presents ESPnet-TTS, a comprehensive, open-source toolkit within the ESPnet framework designed for unified, reproducible, and integratable end-to-end text-to-speech research.
The toolkit supports various E2E-TTS models like Tacotron 2 and FastSpeech, enables integrated ASR-TTS research, and ensures high reproducibility with Kaldi-based recipes and pre-trained models.
Experimental results indicate implemented models achieve a competitive 4.25 MOS on LJSpeech and highlight the FastSpeech model's computational efficiency, enabling advancements in adaptive and robust TTS systems.

Overview of ESPnet-TTS: An Open Source Toolkit for End-to-End Text-to-Speech

The paper presents ESPnet-TTS, a comprehensive open-source toolkit designed to facilitate end-to-end text-to-speech (E2E-TTS) research. The toolkit is a substantial addition to the ESPnet framework, providing tools for developing state-of-the-art E2E-TTS models, including Tacotron 2, Transformer TTS, and FastSpeech.

Key Features of ESPnet-TTS

ESPnet-TTS offers several notable features:

Model Support: The toolkit includes implementations for various prominent E2E-TTS models such as Tacotron 2, Transformer TTS, and FastSpeech. This variety allows researchers to compare and contrast performance across different architectures within a unified framework.
Integrated Design with ASR: Of particular interest is the integrated design shared with ESPnet's automatic speech recognition (ASR) recipe system. This integration not only promotes methodological consistency between TTS and ASR tasks but also enables advanced research, such as ASR-based objective evaluation and semi-supervised learning, combining both ASR and TTS models.
High Reproducibility: The toolkit is accompanied by recipes based on the well-known Kaldi ASR toolkit structure. These recipes ensure high reproducibility in experiments and include pre-trained models and samples for multiple languages, facilitating easy baseline testing and demonstrations.
Practical Implementation Provisions: ESPnet-TTS is designed to be user-friendly, providing ease of experimentation while maintaining a high level of technical sophistication required by the E2E approach.

Experimental Evaluation

The toolkit's efficacy is validated through a series of experiments comparing several E2E-TTS systems, with evaluations based on both objective and subjective metrics. The models implemented in ESPnet-TTS achieved a mean opinion score (MOS) of 4.25 on the LJSpeech dataset, indicating comparable performance with existing leading toolkits. Additionally, the FastSpeech model demonstrated notable computational efficiency, significantly outperforming others in terms of real-time factor (RTF) when generating speech features on GPUs.

Comparative Analysis

The manuscript provides a comprehensive comparative analysis of ESPnet-TTS against other popular E2E-TTS toolkits. Key factors of comparison included model support, multi-speaker capabilities, and pre-trained model availability. ESPnet-TTS showed advantages in supporting a broader range of models and providing extensive pre-trained resources.

Implications for Research and Future Developments

The introduction of ESPnet-TTS has significant implications for both practical and theoretical advancements in TTS technologies. The unification of TTS and ASR in a single framework paves the way for innovative research into multi-task and transfer learning paradigms. This advancement could see rapid developments in areas such as adaptive TTS systems that cater to diverse speaker attributes and robust models capable of delivering high-quality speech synthesis across varied linguistic contexts.

Future development plans for ESPnet-TTS include enhancements such as incorporating knowledge distillation techniques, expanding support for emotional and accent embeddings, and fine-tuning model architectures to achieve superior speech synthesis quality. These advancements could further position ESPnet-TTS as a cornerstone toolkit for cutting-edge research and development in speech processing domains.

Conclusion

ESPnet-TTS represents a significant step forward in the pursuit of efficient, reproducible, and integrable TTS systems. By building upon the strengths of existing speech processing frameworks and expanding functionality to encompass both ASR and TTS tasks within a unified architecture, this toolkit stands to greatly enhance both the scope and quality of future text-to-speech research.

PDF Markdown