Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark (2406.05763v3)

Published 9 Jun 2024 in eess.AS

Abstract: With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains $12,800$ hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Linhan Ma (4 papers)
  2. Dake Guo (9 papers)
  3. Kun Song (30 papers)
  4. Yuepeng Jiang (10 papers)
  5. Shuai Wang (466 papers)
  6. Liumeng Xue (24 papers)
  7. Weiming Xu (13 papers)
  8. Huan Zhao (109 papers)
  9. Binbin Zhang (46 papers)
  10. Lei Xie (337 papers)
Citations (15)

Summary

We haven't generated a summary for this paper yet.