Emergent Mind

Abstract

Recently, speech generation models have made significant progress by using large-scale training data. However, the research community struggle to produce highly spontaneous and human-like speech due to the lack of large-scale, diverse, and spontaneous speech data. This paper present Emilia, the first multilingual speech generation dataset from in-the-wild speech data, and Emilia-Pipe, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation. Emilia starts with over 101k hours of speech in six languages and features diverse speech with varied speaking styles. To facilitate the scale-up of Emilia, the open-source pipeline Emilia-Pipe can process one hour of raw speech data ready for model training in a few mins, which enables the research community to collaborate on large-scale speech generation research. Experimental results validate the effectiveness of Emilia. Demos are available at: https://emilia-dataset.github.io/Emilia-Demo-Page/.

Acoustic and semantic diversities between Emilia and MLS datasets.

Overview

  • The paper introduces 'Emilia,' a large-scale multilingual speech dataset derived from in-the-wild speech data, aimed at enhancing speech generation models.

  • 'Emilia-Pipe,' an open-source preprocessing pipeline, is developed to efficiently convert raw audio data into annotated, training-ready datasets.

  • Experimental validation shows that TTS models trained on Emilia perform comparably or better than those trained on the Multilingual LibriSpeech dataset, especially in generating spontaneous and diverse speech.

Overview of "Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation"

The paper presents "Emilia," a large-scale, multilingual dataset designed to enhance the capabilities of speech generation models. This dataset is derived from in-the-wild speech data, marking a significant step towards achieving more spontaneous, varied, and human-like speech synthesis. In concert with this dataset, the authors introduce "Emilia-Pipe," an open-source preprocessing pipeline capable of rapidly and efficiently converting raw, unstructured audio data into training-ready datasets with necessary annotations.

Dataset Construction and Preprocessing Pipeline

Emilia addresses a pressing challenge in speech generation: the insufficiency of diverse and spontaneous speech data. Traditional datasets, primarily derived from audiobooks, fail to capture the natural variability and spontaneity found in real-world conversations, making it difficult for generative models to reproduce human-like speech accurately.

The dataset comprises over 101,000 hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. This diversity is a crucial advantage for training robust multilingual and spontaneous speech generation models. Emilia-Pipe, the preprocessing pipeline, addresses the limitations of existing methods by incorporating efficient standardization, source separation, speaker diarization, voice activity detection (VAD), automated speech recognition (ASR), and filtering processes. These steps refine the raw speech into high-quality, annotated data suitable for model training.

An independent server equipped with NVIDIA RTX 4090 GPUs processes raw speech data at a rate of approximately 2.5 hours per minute, demonstrating the efficiency of Emilia-Pipe. Post-processing, the dataset maintains a DNSMOS P.835 OVRL score of 3.26, underscoring its comparable quality to existing high-grade datasets.

Experimental Validation

To validate Emilia's effectiveness, the authors conducted extensive experiments comparing TTS models trained on both Emilia and the Multilingual LibriSpeech (MLS) dataset. Models trained on Emilia performed comparably, if not better, in generating spontaneous and diverse speech, as highlighted by metrics such as WER, SIM-O, FSD, CMOS, and SMOS.

In the English-only experiment, models trained on Emilia demonstrated similar levels of intelligibility and speaker similarity compared to those trained on MLS. Notably, the autoregressive TTS model SoundStorm benefited significantly from the diverse speaking styles in Emilia, particularly in terms of FSD and CMOS on spontaneous speech data.

Moreover, the multilingual experiment affirmed that models trained on the full Emilia dataset excel in zero-shot multilingual TTS capabilities, reinforcing the dataset's utility for broad-based, high-quality speech generation.

Implications for Future Research

The implications of this research are multifaceted. Practically, the extensive and diverse Emilia dataset equips the research community with the tools to advance multilingual and spontaneous speech generation. Theoretically, the demonstrated effectiveness of Emilia underscores the importance of training data diversity in model performance, particularly for tasks requiring nuanced generation abilities such as TTS.

Looking forward, the open-source nature of Emilia-Pipe encourages collaborative advancements in preprocessing techniques and dataset expansion. This pipeline provides a scalable solution for transforming vast, raw speech corpora into valuable training data. The potential future applications of such a dataset extend beyond TTS to areas like speech recognition, speaker verification, and emotion detection in speech, opening new avenues for robust AI development.

Conclusion

The paper's contributions, Emilia and Emilia-Pipe, represent significant strides towards enriching the diversity and spontaneity of speech generation datasets. By providing a robust, powerful toolkit for generating high-quality training data from in-the-wild speech, this work lays a critical foundation for future advancements in multilingual and human-like speech synthesis. The outcomes validate the necessity for diverse datasets and innovative preprocessing techniques, promoting a leap forward in the capabilities of speech generation models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube