Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Published 24 May 2017 in cs.CL | (1705.08947v2)

Abstract: We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (483)

View on Semantic Scholar

Summary

The paper introduces low-dimensional trainable speaker embeddings to enable a single model to generate hundreds of distinct, high-fidelity voices.
It demonstrates the effectiveness of a WaveNet-based neural vocoder in Tacotron, significantly boosting MOS scores from 2.57 to 4.17.
The research achieves near-perfect speaker identity preservation with minimal per-speaker data, advancing versatile and scalable neural TTS systems.

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

The research presented in "Deep Voice 2: Multi-Speaker Neural Text-to-Speech" explores the enhancement of neural text-to-speech (TTS) systems by utilizing low-dimensional trainable speaker embeddings to generate multiple voices from a single model. This work builds on previous advancements in neural TTS, improving on existing single-speaker systems like Deep Voice 1 and Tacotron, and extends them to handle hundreds of distinct speaker voices with limited data per speaker.

Methodology and Contributions

Deep Voice 2 Architecture: The Deep Voice 2 system retains the foundational pipeline of its predecessor, Deep Voice 1, but employs high-performance components to deliver a substantial increase in audio quality.
Improved Tacotron with Neural Vocoder: The integration of a WaveNet-based spectrogram-to-audio neural vocoder in Tacotron replaces the traditional Griffin-Lim algorithm, enhancing overall audio output quality. This demonstrates the feasibility of using neural vocoders in TTS for more natural-sounding speech.
Multi-Speaker Training: Introducing trainable speaker embeddings into the Deep Voice 2 and Tacotron models allows a single neural TTS framework to learn and produce a wide variety of voices. The embedding method enables extensive parameter sharing among different voices within the model, significantly reducing data requirements for each speaker compared to single-speaker models.

Results

The paper provides thorough experimental results demonstrating the superiority of Deep Voice 2 over Deep Voice 1 and enhanced performance when using neural vocoders with Tacotron. Notably:

The Deep Voice 2 system exhibited a marked improvement in Mean Opinion Score (MOS) from 2.05 to 2.96 compared to Deep Voice 1, confirming enhanced audio quality.
Tacotron, when paired with the WaveNet neural vocoder, achieved an MOS of 4.17, which is significantly higher than its performance with the Griffin-Lim approach, which was 2.57.
Multi-speaker evaluations show that Deep Voice 2 can generate high-quality multi-speaker outputs with near-perfect speaker identity preservation, achieving classification accuracies comparable to ground truth samples.

Implications and Future Work

The implications of this research are both practical and theoretical. Practically, the ability to generate high-fidelity multi-speaker TTS with minimal data per speaker has significant potential across various applications such as accessibility tools, virtual assistants, and media production. Theoretically, this work advances understanding in the domain of neural TTS systems, particularly in efficient speaker representation and model scalability.

Future investigations could explore the scalability limits of these methods, examining how many speakers can be effectively incorporated and the minimal data requirements for high-quality synthesis. Additionally, research could focus on the adaptability of trained models to new speakers, potentially allowing for dynamic updating of speaker embeddings without retraining the entire system. There is also potential to leverage the learned embeddings for other tasks, such as speaker conversion or voice cloning, expanding the utility of the embeddings beyond TTS.

This study exemplifies the continuing evolution of neural TTS systems, bridging the gap towards more versatile and data-efficient multi-speaker models.

Markdown Report Issue