Emergent Mind

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

(2403.03100)
Published Mar 5, 2024 in eess.AS , cs.AI , cs.CL , cs.LG , and cs.SD

Abstract

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility, and achieves on-par quality with human recordings. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.

Framework displaying factorized diffusion model with phoneme encoder and four diffusion processes, sharing a common formulation.

Overview

  • Introduces NaturalSpeech 3 (NS3), a text-to-speech (TTS) synthesis system leveraging factorized diffusion models and a novel neural codec for high-quality, zero-shot speech synthesis.

  • NS3's FACodec disentangles speech into content, prosody, timbre, and acoustic details, improving control and flexibility in speech synthesis.

  • Empirical tests show NS3 surpasses existing TTS systems in speech quality, voice similarity, and intelligibility on the LibriSpeech test set.

  • The paper discusses NS3's theoretical implications for future natural and controllable TTS systems, suggesting directions for multi-lingual and varied audio synthesis research.

Exploring Zero-Shot Speech Synthesis with \textit{NaturalSpeech 3}: A Leap Towards Natural and Controllable TTS Systems

Introduction

Text-to-speech (TTS) synthesis, the cornerstone of contemporary voice applications, has experienced remarkable advancements driven by the integration of deep learning. Despite these achievements, current large-scale TTS models still display limitations, particularly in achieving speech of superior quality, similarity, and prosody. To address these challenges, our study introduces \textit{NaturalSpeech 3} (NS3), leveraging factorized diffusion models for zero-shot speech synthesis, drawing upon a novel neural codec equipped with factorized vector quantization (FVQ) for speech attribute disentanglement.

Key Contributions

\textit{NaturalSpeech 3} centers around two pivotal components: the FACodec for attribute factorization and the factorized diffusion model for efficient speech generation across disentangled subspaces.

  • FACodec: This new codec disentangles speech into distinct subspaces, specifically content, prosody, timbre, and acoustic details, thereby simplifying the modeling process.
  • Factorized Diffusion Model: Extended from FACodec's disentanglement, this diffusion model generates individual speech attributes in their respective subspaces, offering enhanced control and flexibility in speech synthesis.

Empirical Evaluation

Our comprehensive experiments demonstrate \textit{NaturalSpeech 3}'s superiority over existing TTS systems across multiple dimensions:

  • Significantly improved speech quality, mirroring or surpassing ground-truth speech in both qualitative and quantitative measures on the LibriSpeech test set.
  • Unprecedented accuracy in mimicking the prompt speech's voice and prosody, leading to state-of-the-art similarity scores.
  • Enhanced speech intelligibility, as evidenced by a reduction in word error rate (WER) metrics.

Furthermore, the scalability of NS3 is showcased through experiments that expand the system to 1 billion parameters and 200k hours of training data, presenting a promising avenue for future enhancements.

Theoretical Implications and Future Directions

The introduction of NS3 constitutes a crucial step forward in the quest for highly natural and controllable speech synthesis. By conceptualizing speech as a conglomeration of disentangled attributes and applying a divide-and-conquer strategy in their generation, we inherently increase the model's control over the synthesized speech's characteristics. This flexibility paves the way for a myriad of applications, from customizable voice assistants to sophisticated audio content generation.

Future research directions could extend the efficacy of the factorized diffusion model and explore its applicability in multi-lingual contexts or other forms of audio synthesis. Additionally, investigating the semantic integration between textual content and prosodic features could yield further improvements in naturalness and expressiveness.

Conclusion

\textit{NaturalSpeech 3} propels the boundary of what's achievable in text-to-speech synthesis, marking a significant leap towards the realization of truly lifelike and customizable synthetic speech. Through its novel approach to speech factorization and generation, NS3 not only achieves state-of-the-art results but also introduces a versatile framework for future innovations in the field of generative AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube