- The paper introduces Emilia and its open-source pipeline to transform raw, in-the-wild audio into annotated multilingual speech data.
- It details a methodology that integrates source separation, VAD, ASR, and filtering to ensure high-quality, diverse speech suitable for TTS.
- Experimental results demonstrate that models trained on Emilia perform on par with or superior to those trained on traditional datasets like MLS.
Overview of "Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation"
The paper presents "Emilia," a large-scale, multilingual dataset designed to enhance the capabilities of speech generation models. This dataset is derived from in-the-wild speech data, marking a significant step towards achieving more spontaneous, varied, and human-like speech synthesis. In concert with this dataset, the authors introduce "Emilia-Pipe," an open-source preprocessing pipeline capable of rapidly and efficiently converting raw, unstructured audio data into training-ready datasets with necessary annotations.
Dataset Construction and Preprocessing Pipeline
Emilia addresses a pressing challenge in speech generation: the insufficiency of diverse and spontaneous speech data. Traditional datasets, primarily derived from audiobooks, fail to capture the natural variability and spontaneity found in real-world conversations, making it difficult for generative models to reproduce human-like speech accurately.
The dataset comprises over 101,000 hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. This diversity is a crucial advantage for training robust multilingual and spontaneous speech generation models. Emilia-Pipe, the preprocessing pipeline, addresses the limitations of existing methods by incorporating efficient standardization, source separation, speaker diarization, voice activity detection (VAD), automated speech recognition (ASR), and filtering processes. These steps refine the raw speech into high-quality, annotated data suitable for model training.
An independent server equipped with NVIDIA RTX 4090 GPUs processes raw speech data at a rate of approximately 2.5 hours per minute, demonstrating the efficiency of Emilia-Pipe. Post-processing, the dataset maintains a DNSMOS P.835 OVRL score of 3.26, underscoring its comparable quality to existing high-grade datasets.
Experimental Validation
To validate Emilia's effectiveness, the authors conducted extensive experiments comparing TTS models trained on both Emilia and the Multilingual LibriSpeech (MLS) dataset. Models trained on Emilia performed comparably, if not better, in generating spontaneous and diverse speech, as highlighted by metrics such as WER, SIM-O, FSD, CMOS, and SMOS.
In the English-only experiment, models trained on Emilia demonstrated similar levels of intelligibility and speaker similarity compared to those trained on MLS. Notably, the autoregressive TTS model SoundStorm benefited significantly from the diverse speaking styles in Emilia, particularly in terms of FSD and CMOS on spontaneous speech data.
Moreover, the multilingual experiment affirmed that models trained on the full Emilia dataset excel in zero-shot multilingual TTS capabilities, reinforcing the dataset's utility for broad-based, high-quality speech generation.
Implications for Future Research
The implications of this research are multifaceted. Practically, the extensive and diverse Emilia dataset equips the research community with the tools to advance multilingual and spontaneous speech generation. Theoretically, the demonstrated effectiveness of Emilia underscores the importance of training data diversity in model performance, particularly for tasks requiring nuanced generation abilities such as TTS.
Looking forward, the open-source nature of Emilia-Pipe encourages collaborative advancements in preprocessing techniques and dataset expansion. This pipeline provides a scalable solution for transforming vast, raw speech corpora into valuable training data. The potential future applications of such a dataset extend beyond TTS to areas like speech recognition, speaker verification, and emotion detection in speech, opening new avenues for robust AI development.
Conclusion
The paper's contributions, Emilia and Emilia-Pipe, represent significant strides towards enriching the diversity and spontaneity of speech generation datasets. By providing a robust, powerful toolkit for generating high-quality training data from in-the-wild speech, this work lays a critical foundation for future advancements in multilingual and human-like speech synthesis. The outcomes validate the necessity for diverse datasets and innovative preprocessing techniques, promoting a leap forward in the capabilities of speech generation models.