E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS (2406.18009v2)

Published 26 Jun 2024 in eess.AS and cs.SD

Abstract: This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the audio infilling task. Unlike many previous works, it does not require additional components (e.g., duration model, grapheme-to-phoneme) or complex techniques (e.g., monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that are comparable to or surpass previous works, including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the input representation. We propose several variants of E2 TTS to improve usability during inference. See https://aka.ms/e2tts/ for demo samples.

Citations (16)

View on Semantic Scholar

Summary

The paper introduces a fully non-autoregressive TTS model using flow-matching to achieve state-of-the-art performance in speed, speaker similarity, and intelligibility.
It simplifies TTS by eliminating duration models and phoneme aligners, resulting in a more streamlined and efficient inference process.
Extended variants E2 TTS X1 and X2 enhance usability by supporting transcription-free inputs and explicit pronunciation guidance for tailored outputs.

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

The paper "E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS" introduces a novel and straightforward approach to zero-shot text-to-speech (TTS) that achieves state-of-the-art performance. The main appeal of E2 TTS lies in its fully non-autoregressive (NAR) architecture and the elimination of traditional complexities like duration models and grapheme-to-phoneme conversions, setting a new benchmark in zero-shot TTS.

Core Contributions

The primary contributions of the paper can be identified as follows:

Non-Autoregressive Architecture: Unlike autoregressive models, which suffer from increased inference latency due to sequential sampling, E2 TTS leverages a fully NAR framework. This significantly speeds up the inference process.
Flow-Matching-Based Mel Spectrogram Generator: The architecture employs a flow-matching based mel spectrogram generator, simplifying the entire pipeline. The model is trained on an audio infilling task using a vanilla Transformer with U-Net style skip connections.
Zero-Shot Capabilities: E2 TTS exhibits state-of-the-art performance in speaker similarity and intelligibility, often surpassing established models like Voicebox and NaturalSpeech 3.
Usability Variants: The paper introduces extensions such as E2 TTS X1, which eliminates the need for transcription of the audio prompt, and E2 TTS X2, allowing explicit pronunciation guidance for specific words.

Methodology

Training Process

The training phase involves converting the text input into a character sequence appended with filler tokens to ensure length compatibility with the mel spectrogram sequence. The mel spectrogram generator, a key component, uses a flow-matching objective to learn to map this input to the appropriate audio output. The simplicity of the architecture, devoid of traditional components like phoneme aligners, underscores its effectiveness.

Inference Process

During inference, E2 TTS operates by generating a mel-filterbank sequence leveraging the learned distribution from training. The configuration permits arbitrary determination of the target duration of the output speech, enhancing flexibility.

Empirical Evaluation

Objective Metrics

The paper provides a comparative analysis of E2 TTS against established baselines, namely VALL-E, NaturalSpeech 3, and Voicebox, using the LibriSpeech-PC dataset for evaluation:

Word Error Rate (WER): E2 TTS achieves an impressive WER of 1.9\%, outperforming all tested baselines.
Speaker Similarity (SIM-o): The model demonstrates high speaker similarity (SIM-o = 0.708), showcasing its robustness in maintaining speaker characteristics.

Subjective Metrics

CMOS: The comparative mean opinion score (CMOS) for E2 TTS is -0.05, indicating indistinguishable naturalness from human speech.
SMOS: Subjective evaluations place the speaker similarity of E2 TTS on par with or better than human ratings, showcasing its capability to generate highly natural-sounding audio.

Extensions

E2 TTS X1 and E2 TTS X2 further enhance the model's usability:

E2 TTS X1: Eliminates the need for transcriptions of the audio prompt, maintaining similar WER and SIM-o as the base model.
E2 TTS X2: Allows explicit pronunciation specification for certain words, enabling better handling of unique terms without the need for model retraining.

Practical and Theoretical Implications

Practical Implications:

Speed: The fully NAR approach ensures quick inference times, making E2 TTS suitable for real-time applications.
Simplicity: By eliminating complex dependencies like duration models and phoneme aligners, E2 TTS simplifies deployment and maintenance.

Theoretical Implications:

Modeling Simplicity vs. Performance: The success of E2 TTS demonstrates that simpler, well-designed models can achieve or surpass the performance of more complex architectures.
Potential for Further Simplification: The model opens avenues for exploring other simplification strategies in TTS and related fields, promoting a trend towards more efficient, scalable AI systems.

Conclusion and Future Directions

The introduction of E2 TTS marks a significant advancement in the field of zero-shot TTS. Its non-autoregressive framework and simplicity do not compromise performance, making it a robust and flexible solution. Future research could explore integrating E2 TTS with other LLMs, refining training techniques further, and expanding its utility across diverse applications.

In summary, E2 TTS presents a compelling paradigm shift towards simpler, yet high-performing TTS systems, setting a new standard for future developments in the domain.

PDF Markdown

Related Papers

Tweets

https://twitter.com/NielsRogge/status/1850649999383711955

https://twitter.com/naoyukikandaslp/status/1864513964396466646

https://twitter.com/naoyukikandaslp/status/1806152728642809921

https://twitter.com/ShutterNetwork/status/1853403768840425878

https://twitter.com/realmrfakename/status/1844424867904618642

YouTube

Show All Videos