Emergent Mind

FlashSpeech: Efficient Zero-Shot Speech Synthesis

(2404.14700)
Published Apr 23, 2024 in eess.AS , cs.AI , cs.CL , cs.LG , and cs.SD

Abstract

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in https://flashspeech.github.io/.

The architecture of FlashSpeech featuring a codec encoder/decoder and a latent consistency model.

Overview

  • FlashSpeech presents a novel approach to efficient zero-shot speech synthesis, significantly reducing the time and computational resources needed compared to traditional methods.

  • The system uses a unique adversarial consistency training method alongside a Latent Consistency Model (LCM), eliminating the reliance on pre-trained diffusion models and enabling fast, high-quality audio synthesis.

  • FlashSpeech incorporates a prosody generator that enhances the natural rhythm of speech without affecting stability, adapting well to various tasks like voice conversion and speech editing.

  • With its rapid synthesis capabilities, FlashSpeech is highly suitable for real-time applications, fundamentally lowering hardware requirements and operational costs while providing insights into model training optimizations.

Accelerating Large-Scale Zero-Shot Speech Synthesis with FlashSpeech

Introduction

Speech synthesis technologies have made significant strides but are hindered by substantial computational resources and time constraints imposed by most state-of-the-art methods. Attempting to resolve these issues, this paper introduces FlashSpeech, a new approach to efficient zero-shot speech synthesis. FlashSpeech utilizes a Latent Consistency Model (LCM) combined with an innovative adversarial consistency training method, eliminating the need for a pre-trained diffusion model as a teacher. It achieves speech synthesis in one or two sampling steps with high audio quality and speaker similarity, demonstrating about 20 times faster inference than existing systems.

Key Contributions

  • FlashSpeech System: An efficient zero-shot speech synthesis system with drastically reduced inference time and computational requirements.
  • Adversarial Consistency Training: A unique training methodology combining adversarial training and consistency training to effectively leverage pre-trained speech language models.
  • Prosody Generation: Introduction of a prosody generator module that enhances the diversity of prosody, achieving natural rhythm in synthesized speech without compromising stability.

System Architecture and Training

FlashSpeech incorporates a prosody generator alongside a novel consistency model, conditioned on prior vectors obtained from a phoneme encoder and a prompt encoder. The adversarial consistency training employs pre-trained speech models as discriminators, training the LCM efficiently from scratch. This system shows remarkable efficiency improvements during inference, essentially reducing the computation down to a constant time complexity, (\mathcal{O}(1)), irrespective of the input sequence length.

Experimental Results

The experimental evaluations underscore FlashSpeech's superior performance over other zero-shot speech synthesis systems in terms of synthesis speed, maintaining comparable voice quality and speaker similarity. The system also demonstrates robustness across varied tasks such as voice conversion, speech editing, and diverse speech sampling, with concrete applications reflected in real-world scenarios like virtual assistants and interactive educational content.

Practical Implications and Theoretical Implications

Practically, FlashSpeech's speed and efficiency facilitate real-time speech synthesis applications, reducing hardware demands and operational costs markedly. Theoretically, the introduction of adversarial consistency training provides a novel way of leveraging pre-trained models for speech synthesis tasks, offering potential insights into model training optimizations across other domains of generative modeling.

Future Directions

Continuing this line of research could involve scaling the system to handle more extensive datasets and more languages, which might further refine its capabilities in capturing nuances in speech. Additionally, further exploration into refining the adversarial consistency training could yield even more efficient training methodologies or adaptations to other forms of media like music or environmental sounds.

In conclusion, FlashSpeech sets a new precedence for speed and efficiency in speech synthesis while maintaining high standards for audio quality and speaker accuracy, marking a significant step forward in the practical deployment of zero-shot speech synthesis technologies.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube