CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages (1903.11269v3)

Published 27 Mar 2019 in cs.CL

Abstract: We describe our development of CSS10, a collection of single speaker speech datasets for ten languages. It is composed of short audio clips from LibriVox audiobooks and their aligned texts. To validate its quality we train two neural text-to-speech models on each dataset. Subsequently, we conduct Mean Opinion Score tests on the synthesized speech samples. We make our datasets, pre-trained models, and test resources publicly available. We hope they will be used for future speech tasks.

Authors (2)

Kyubyong Park (12 papers)
Thomas Mulc (4 papers)

Citations (98)

View on Semantic Scholar

Summary

The paper introduces CSS10, offering publicly available single-speaker speech datasets in 10 diverse languages for TTS research.
It employs a meticulous methodology using LibriVox audiobook segmentation and precise text alignment to ensure data quality.
The evaluation with Tacotron and DCTTS models via MOS tests demonstrates the datasets’ effectiveness in improving multilingual TTS performance.

Overview of the CSS10 Speech Dataset Paper

The paper introduces CSS10, a diverse collection of single-speaker speech datasets encompassing ten languages. This work primarily addresses the scarcity of freely available, high-quality non-English datasets which poses a significant challenge to the broader research community, particularly in the area of Text-to-Speech (TTS) systems. The datasets are derived from LibriVox audiobooks and meticulously aligned with their textual counterparts to ensure accuracy.

Motivations and Contributions

The motivation behind CSS10 stems from the lack of accessible benchmark datasets in TTS research, especially in languages other than English. Until now, most TTS models relied on proprietary datasets, making it arduous for researchers outside of large institutions to reproduce results or compare different models effectively. Furthermore, TTS research predominantly focuses on English, leaving a substantial gap in multilingual capabilities.

The contributions of this paper are twofold:

The creation of single-speaker speech datasets for ten languages—Chinese, Dutch, French, Finnish, German, Greek, Hungarian, Japanese, Russian, and Spanish.
The evaluation of these datasets using two established neural TTS models, Tacotron and DCTTS, via Mean Opinion Score (MOS) tests.

These resources and evaluations are made publicly available, inviting the community to leverage them for further research and development.

Methodology and Dataset Construction

The dataset was constructed using audiobooks from LibriVox, a public platform offering texts in diverse languages. The selection process excluded any language with less than four hours of solo recordings to ensure viability for training TTS models. Audio processing involved segmenting lengthy audiobook files into manageable clips for TTS purposes, optimizing their length for computational efficiency.

The text processing process was equally meticulous. Each audio clip was manually aligned with its text segment, and comprehensive text normalization was applied, including handling phonetic transcriptions for non-phonetic languages like Chinese and Japanese. For Chinese, the text was converted to pinyin, and Japanese scripts were converted using MeCab and romkan.

Evaluation and Results

The datasets were evaluated by training Tacotron and DCTTS models on each language set, followed by a MOS evaluation to assess the naturalness and pronunciation accuracy of the synthesized speech. Notably, DCTTS generally produced higher MOS scores for naturalness in languages like German, French, and Spanish, while pronunciation scores were relatively similar between the two models. The Greek dataset presented challenges due to its smaller size, impacting Tacotron's performance.

Implications and Future Directions

The paper’s implications are significant for the advancement of multilingual TTS research. By providing quality datasets in multiple languages, CSS10 enables standardized comparisons and verifications of TTS models across languages, thus facilitating more robust and generalized model development.

Future research possibilities include refining the automatic phonetic transcriptions to enhance model performance and extending the dataset to include additional languages, such as Korean. The authors also highlight the versatility of CSS10, suggesting its applicability in tasks beyond TTS, such as multilingual speech recognition.

In essence, CSS10 stands as a crucial asset for the research community, bridging a gap in multilingual TTS resources and setting a foundation for future advancements in the field.

PDF Markdown

Related Papers

GitHub

GitHub - Kyubyong/css10: CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages (473 stars)