AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline (1709.05522v1)

Published 16 Sep 2017 in cs.CL

Abstract: An open-source Mandarin speech corpus called AISHELL-1 is released. It is by far the largest corpus which is suitable for conducting the speech recognition research and building speech recognition systems for Mandarin. The recording procedure, including audio capturing devices and environments are presented in details. The preparation of the related resources, including transcriptions and lexicon are described. The corpus is released with a Kaldi recipe. Experimental results implies that the quality of audio recordings and transcriptions are promising.

Citations (759)

View on Semantic Scholar

Summary

The paper introduces a large-scale 170-hour Mandarin speech corpus featuring diverse acoustic conditions and rich demographic metadata.
It presents a robust ASR baseline using GMM-HMM, TDNN-HMM, and LFMMI models, achieving character error rates as low as 6.44%.
The study validates the corpus's effectiveness across various devices and domains, fostering further research in Mandarin ASR.

Overview of AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline

The paper "AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline" introduces the AISHELL-1, a substantial open-source Mandarin speech corpus. This resource is aimed to bridge the gap between industrial and academic communities by providing a publicly accessible dataset suitable for developing and benchmarking Automatic Speech Recognition (ASR) systems for Mandarin.

Dataset Composition and Structure

AISHELL-1 is derived from the larger AISHELL-ASR0009 corpus, which encompasses 500 hours of multi-channel Mandarin speech data. AISHELL-1 itself comprises 170 hours of speech from 400 speakers, recorded using three categories of devices: high-fidelity microphones, Android phones, and iPhones. This variety ensures the dataset captures a broad range of acoustic conditions corresponding to typical usage scenarios.

The metadata associated with each speaker includes gender, age, accent, and birthplace, providing a comprehensive set of demographic information. The corpus is stratified into training, development, and test sets, ensuring non-overlapping speaker distributions, which are crucial for unbiased model training and evaluation.

Transcription and Lexicon Creation

The transcription process for AISHELL-1 involved manual efforts to eliminate content related to sensitive subjects and ensure high-quality text representations. The corpus covers five major domains—"Finance," "Science and Technology," "Sports," "Entertainment," and "News." These domains ensure the dataset's applicability to various real-world applications. Additionally, a detailed Mandarin lexicon accompanies AISHELL-1, derived from open-source resources and enriched to cover common Chinese words and characters with tonal information presented in initial-final syllables.

Baseline ASR System

To validate the utility of AISHELL-1, the authors provide a baseline ASR system implemented in Kaldi. The baseline system employs a traditional GMM-HMM framework, followed by more advanced models including a TDNN-based DNN-HMM system and a lattice-free MMI (LFMMI) trained system. The experimental setup leverages both standard MFCC features augmented with pitch information and i-Vectors for speaker adaptation.

GMM-HMM System: Uses MLLT and SAT techniques, achieving CERs of 10.43% on the development set and 12.23% on the test set.
TDNN-HMM System: Includes acoustic data augmentation and achieves CERs of 7.23% on the development set and 8.42% on the test set.
LFMMI System: The best performing model, reaching CERs of 6.44% on the development set and 7.62% on the test set.

Evaluation on Diverse Acoustic Conditions

To assess the generalization capabilities of trained models, the paper presents evaluations on mobile device recordings and the THCHS30 dataset. Both setups demonstrate the robustness of the trained models. Notably, the LFMMI model provides the most significant improvements across different recording conditions, with CERs reduced to 10.90% for iOS recordings and 10.09% for Android devices. When evaluating on THCHS30, it achieves a CER of 25.00%, indicating strong cross-domain performance.

Implications and Future Directions

AISHELL-1 represents a substantial resource for the Mandarin ASR community and addresses the crucial need for publicly accessible data in academia. Its comprehensive coverage, high quality, and the inclusion of a detailed lexicon contribute to its value in developing robust and versatile ASR systems. Future work may explore extending the dataset to include more diverse speakers and noisy environments, enhancing the corpus's applicability to real-world scenarios.

Moreover, the availability of AISHELL-1 under the Apache 2.0 license encourages collaborative innovation and development, potentially leading to advancements in ASR technologies. Researchers can leverage this dataset to pursue challenges such as improved tonal recognition, better handling of regional accents, and the development of more adaptive and resilient acoustic models.

In summary, AISHELL-1 significantly contributes to the field of speech recognition by providing a large-scale, high-quality Mandarin speech corpus with a robust baseline system, fostering further research and advancement in Mandarin ASR technologies.

PDF Markdown