Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
9 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation (2407.05361v3)

Published 7 Jul 2024 in eess.AS and cs.CL

Abstract: Recent advancements in speech generation models have been significantly driven by the use of large-scale training data. However, producing highly spontaneous, human-like speech remains a challenge due to the scarcity of large, diverse, and spontaneous speech datasets. In response, we introduce Emilia, the first large-scale, multilingual, and diverse speech generation dataset. Emilia starts with over 101k hours of speech across six languages, covering a wide range of speaking styles to enable more natural and spontaneous speech generation. To facilitate the scale-up of Emilia, we also present Emilia-Pipe, the first open-source preprocessing pipeline designed to efficiently transform raw, in-the-wild speech data into high-quality training data with speech annotations. Experimental results demonstrate the effectiveness of both Emilia and Emilia-Pipe. Demos are available at: https://emilia-dataset.github.io/Emilia-Demo-Page/.

Citations (12)

Summary

  • The paper introduces Emilia and its open-source pipeline to transform raw, in-the-wild audio into annotated multilingual speech data.
  • It details a methodology that integrates source separation, VAD, ASR, and filtering to ensure high-quality, diverse speech suitable for TTS.
  • Experimental results demonstrate that models trained on Emilia perform on par with or superior to those trained on traditional datasets like MLS.

Overview of "Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation"

The paper presents "Emilia," a large-scale, multilingual dataset designed to enhance the capabilities of speech generation models. This dataset is derived from in-the-wild speech data, marking a significant step towards achieving more spontaneous, varied, and human-like speech synthesis. In concert with this dataset, the authors introduce "Emilia-Pipe," an open-source preprocessing pipeline capable of rapidly and efficiently converting raw, unstructured audio data into training-ready datasets with necessary annotations.

Dataset Construction and Preprocessing Pipeline

Emilia addresses a pressing challenge in speech generation: the insufficiency of diverse and spontaneous speech data. Traditional datasets, primarily derived from audiobooks, fail to capture the natural variability and spontaneity found in real-world conversations, making it difficult for generative models to reproduce human-like speech accurately.

The dataset comprises over 101,000 hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. This diversity is a crucial advantage for training robust multilingual and spontaneous speech generation models. Emilia-Pipe, the preprocessing pipeline, addresses the limitations of existing methods by incorporating efficient standardization, source separation, speaker diarization, voice activity detection (VAD), automated speech recognition (ASR), and filtering processes. These steps refine the raw speech into high-quality, annotated data suitable for model training.

An independent server equipped with NVIDIA RTX 4090 GPUs processes raw speech data at a rate of approximately 2.5 hours per minute, demonstrating the efficiency of Emilia-Pipe. Post-processing, the dataset maintains a DNSMOS P.835 OVRL score of 3.26, underscoring its comparable quality to existing high-grade datasets.

Experimental Validation

To validate Emilia's effectiveness, the authors conducted extensive experiments comparing TTS models trained on both Emilia and the Multilingual LibriSpeech (MLS) dataset. Models trained on Emilia performed comparably, if not better, in generating spontaneous and diverse speech, as highlighted by metrics such as WER, SIM-O, FSD, CMOS, and SMOS.

In the English-only experiment, models trained on Emilia demonstrated similar levels of intelligibility and speaker similarity compared to those trained on MLS. Notably, the autoregressive TTS model SoundStorm benefited significantly from the diverse speaking styles in Emilia, particularly in terms of FSD and CMOS on spontaneous speech data.

Moreover, the multilingual experiment affirmed that models trained on the full Emilia dataset excel in zero-shot multilingual TTS capabilities, reinforcing the dataset's utility for broad-based, high-quality speech generation.

Implications for Future Research

The implications of this research are multifaceted. Practically, the extensive and diverse Emilia dataset equips the research community with the tools to advance multilingual and spontaneous speech generation. Theoretically, the demonstrated effectiveness of Emilia underscores the importance of training data diversity in model performance, particularly for tasks requiring nuanced generation abilities such as TTS.

Looking forward, the open-source nature of Emilia-Pipe encourages collaborative advancements in preprocessing techniques and dataset expansion. This pipeline provides a scalable solution for transforming vast, raw speech corpora into valuable training data. The potential future applications of such a dataset extend beyond TTS to areas like speech recognition, speaker verification, and emotion detection in speech, opening new avenues for robust AI development.

Conclusion

The paper's contributions, Emilia and Emilia-Pipe, represent significant strides towards enriching the diversity and spontaneity of speech generation datasets. By providing a robust, powerful toolkit for generating high-quality training data from in-the-wild speech, this work lays a critical foundation for future advancements in multilingual and human-like speech synthesis. The outcomes validate the necessity for diverse datasets and innovative preprocessing techniques, promoting a leap forward in the capabilities of speech generation models.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com