WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

Published 7 Oct 2021 in cs.SD and cs.CL | (2110.03370v5)

Abstract: In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation purpose in training, Test_Net, collected from Internet for matched test, and Test_Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-sourced Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

Abstract PDF Upgrade to Chat

Authors (12)

Citations (184)

View on Semantic Scholar

Summary

The paper introduces a 10,000+ hour high-quality Mandarin ASR corpus assembled using OCR and end-to-end forced alignment.
It categorizes data into strong, weak, and others, ensuring diverse acoustic conditions and transcription fidelity.
Benchmarks with Kaldi, ESPnet, and WeNet validate its robust performance and potential to generalize ASR systems across scenarios.

An Analysis of WenetSpeech: A Comprehensive Mandarin Speech Corpus for ASR Systems

The paper "WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition" presents a large and diverse Mandarin corpus designed to advance Automatic Speech Recognition (ASR) systems. WenetSpeech aims to fill the gap between large-scale industrial ASR systems and currently available open-source Mandarin corpora. The corpus consists of more than 22,400 hours of Mandarin speech data, which includes over 10,000 hours of high-quality labeled data, thereby positioning it as the largest open-source Mandarin speech corpus to date.

Methodology and Corpus Composition

WenetSpeech is derived from online sources, such as YouTube and Podcasts, encapsulating various speaking styles and noisy conditions. The dataset was assembled using an integrated approach that involves Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) based transcription methods, with a robust error detection mechanism to ensure transcription quality. A unique pipeline employs OCR for video subtitle extraction, followed by forced alignment with a novel CTC-based end-to-end force alignment approach to validate transcription accuracy.

The corpus is divided into Strong Label, Weak Label, and Others sets based on transcription confidence, with 10,000 hours classified as Strong Label data. This comprehensive corpus comprises ten domain categories, providing a balanced mix of content such as audiobooks, commentary, documentaries, and drama. Notably, drama accounts for a sizeable portion, presenting diverse acoustic scenarios and expanding the corpus’s utility across various ASR tasks.

Evaluation and Benchmarks

WenetSpeech's efficacy is tested using various ASR toolkits, including Kaldi, ESPnet, and WeNet, providing benchmark results across three labeled test sets: Dev, Test_Net, and Test_Meeting. These test sets are constructed to reflect both matched conditions with training data and challenging mismatched conditions such as meeting speech, thereby ensuring a thorough evaluation of a system’s generalization capabilities. The Kaldi baseline, utilizing lattice-free MMI, provides a robust benchmark, while ESPnet's Conformer architecture and WeNet's U2 model offer insights into recent end-to-end methodologies.

Implications and Future Directions

WenetSpeech’s large scale and multi-domain coverage exemplify progress toward more generalized and robust ASR systems, emphasizing the importance of data diversity and quality in model development. The resource is expected to empower academia and smaller research groups by providing access comparable to industrial datasets, fostering advancement in Mandarin ASR technologies.

With its extensible design, WenetSpeech paves the way for future expansions and refinements. Anticipated developments may include the integration of additional data sources or enhanced transcription validation techniques by leveraging emerging advancements in self-supervised learning and unsupervised neural approaches.

In summary, WenetSpeech provides a critical resource enabling the research community to explore production-level ASR models, addressing limitations faced by existing open-source corpora. Its introduction is set to catalyze innovations aiming to reduce error rates and improve performance in real-world scenarios, making Mandarin speech technologies more accessible and effective.

Markdown Report Issue