Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings

Published 23 Oct 2019 in eess.AS | (1910.10838v2)

Abstract: While speaker adaptation for end-to-end speech synthesis using speaker embeddings can produce good speaker similarity for speakers seen during training, there remains a gap for zero-shot adaptation to unseen speakers. We investigate multi-speaker modeling for end-to-end text-to-speech synthesis and study the effects of different types of state-of-the-art neural speaker embeddings on speaker similarity for unseen speakers. Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task; these embeddings also improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in end-to-end speech synthesis.

Abstract PDF Upgrade to Chat

Citations (167)

View on Semantic Scholar

Summary

The paper demonstrates that advanced neural speaker embeddings enable effective zero-shot adaptation in multi-speaker TTS systems.
It employs extensive evaluations of x-vectors and LDE embeddings across various architectures, pooling methods, and classifier configurations.
Evaluation reveals improved naturalness and speaker similarity for unseen speakers, highlighting promising cross-domain applications between ASV and TTS.

Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings

The paper presents a comprehensive examination of zero-shot speaker adaptation in the context of end-to-end text-to-speech (TTS) synthesis. In particular, it explores the efficacy of various state-of-the-art neural speaker embeddings, such as learnable dictionary encoding (LDE) with angular softmax loss, for modeling multi-speaker TTS. This research is positioned within the broader landscape of leveraging speaker embeddings, traditionally purposed for speaker verification, for speaker adaptation in TTS systems.

Key Highlights

Speaker Adaptation Challenges: Traditional TTS models exhibit high fidelity when synthesizing speech from speakers encountered during training. However, the challenge persists in robustly adapting these models to "unseen" speakers with minimal data, termed as zero-shot adaptation. This study aims to address this gap by harnessing advanced neural speaker embeddings in TTS.
Neural Speaker Embeddings: The experimentation centered around x-vectors and LDEs, each representing different approaches in speaker recognition. The primary goal was to ascertain whether embeddings known to perform well in automatic speaker verification (ASV) could likewise enhance speaker similarity and naturalness in TTS for unseen speakers.
Experimentation with Embeddings: A series of embedding configurations were tested. The embeddings varied in terms of network architecture (e.g., TDNN versus ResNet34), pooling techniques (e.g., statistical pooling versus LDE), and classifier configurations (e.g., softmax versus angular softmax). The capacity of these embeddings to generalize across unseen speakers was critically evaluated.
TTS Model Architecture: Built upon an extended version of the Tacotron model, the multi-speaker TTS system integrated self-attention mechanisms to better capture long-range dependencies. Speaker embeddings were concatenated with features at different points in the TTS pipeline to condition speaker identity effectively.
Evaluation: The authors methodically evaluated the TTS systems by conducting comprehensive MOS and DMOS assessments. These evaluations provided quantitative insights into the naturalness and speaker similarity accomplished for both seen and unseen speakers. The results suggested that advanced LDE configurations outperformed traditional x-vectors in terms of both naturalness and speaker similarity for unseen speakers.

Implications and Future Directions

The outcomes of this study underscore the potential of leveraging neural speaker embeddings for overcoming limitations in zero-shot speaker adaptation for TTS systems. The positive interplay between ASV and TTS tasks suggests a promising avenue for cross-pollination between these domains. Practically, the work opens possibilities for deploying TTS systems in applications where rapid adaptation to a diverse speaker set is imperative without necessitating extensive fine-tuning or transcribed data.

Future investigations could address overfitting issues identified for seen speakers, possibly through speaker space augmentation techniques. Another promising research direction involves evaluating how these embeddings handle idiosyncratic features of speech, such as dialect nuances and stylistic variations, which remain underexplored in the current study.

This paper contributes to the growing body of work in speaker adaptive synthesis by not only providing a systematic evaluation of different speaker embedding technologies but also setting the stage for further research that could better harmonize ASV and TTS frameworks for real-world applications.

Markdown Report Issue