SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

Published 2 Apr 2021 in eess.AS and cs.SD | (2104.05557v2)

Abstract: In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model converges using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality.

Abstract PDF Upgrade to Chat

Authors (9)

Citations (87)

View on Semantic Scholar

Summary

The paper introduces a novel zero-shot TTS model that leverages a speaker-conditional architecture to synthesize personalized voices without additional training.
It compares various encoder designs—including dilated residual, gated, and transformer-based—to optimize multi-speaker speech synthesis quality.
Using only 11 training speakers, the model achieves superior MOS and SECS scores, outperforming Tacotron 2 systems with a fine-tuned HiFi-GAN vocoder.

Overview of SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

The paper introduces SC-GlowTTS, a novel text-to-speech (TTS) model that leverages zero-shot learning for multi-speaker voice synthesis. The research aims to improve the similarity of synthesized speech to speakers not seen during model training, which has significant implications for personalized voice synthesis applications.

Key Contributions

Speaker-Conditional Architecture: SC-GlowTTS uses a speaker-conditional architecture with a flow-based decoder. The model is innovative in its integration of this architecture into a zero-shot scenario, providing improvements in the creation of new speaker voices without additional training.
Encoder Exploration: The research explores the use of a dilated residual convolutional encoder, a gated convolutional encoder, and a transformer-based encoder. This study aims to find the most effective approach for handling the complexities of multi-speaker TTS.
Vocoder Adjustment: The study demonstrates how adjusting a GAN-based vocoder using spectrogram predictions from the TTS model on training data enhances both similarity and quality of speech synthesized from new speakers.

Experimental Results

The SC-GlowTTS model achieved competitive performance using only 11 speakers for training, indicating its efficiency and potential for scalability. The Mean Opinion Score (MOS) and Speaker Encoder Cosine Similarity (SECS) results indicated that SC-GlowTTS produces high-quality speech with close resemblance to novel speakers. Specifically, SC-GlowTTS with the HiFi-GAN vocoder significantly outperformed traditional Tacotron 2 models in terms of SECS and MOS for unseen speakers, demonstrating the robustness of this approach.

The SC-GlowTTS architecture with a transformer-based encoder, named SC-GlowTTS-Trans, particularly delivered the highest scores in SECS when compared to its counterparts, SC-GlowTTS-Res and SC-GlowTTS-Gated. Fine-tuning the HiFi-GAN vocoder further improved the results across all tested models, enhancing the practical applicability of this research.

Implications and Future Work

The paper’s findings suggest that SC-GlowTTS has considerable practical implications for TTS systems required to adapt to new speakers with minimal data. This makes it especially relevant for applications in personalized voice assistants and systems requiring quick adaptation to speaker changes. The efficiency in training with a limited dataset points toward significant advancements in low-resource language applications and audio synthesis tasks requiring high adaptability.

Future work, as proposed by the authors, aims to extend SC-GlowTTS for few-shot learning, further reducing the data requirements for high-quality TTS models. Exploring additional encoder architectures and optimizing vocoder integration will continue to refine the model's performance. Additionally, potential applications in cross-lingual TTS could expand the model's utility beyond monolingual contexts.

In summary, SC-GlowTTS offers a promising avenue for zero-shot multi-speaker TTS, showcasing advancements in model architecture and training efficiency. The comprehensive experimental evaluations provide a robust foundation for future research endeavors in adaptive and high-fidelity speech synthesis.

Markdown Report Issue