AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale

Published 31 Aug 2018 in cs.CL | (1808.10583v2)

Abstract: AISHELL-1 is by far the largest open-source speech corpus available for Mandarin speech recognition research. It was released with a baseline system containing solid training and testing pipelines for Mandarin ASR. In AISHELL-2, 1000 hours of clean read-speech data from iOS is published, which is free for academic usage. On top of AISHELL-2 corpus, an improved recipe is developed and released, containing key components for industrial applications, such as Chinese word segmentation, flexible vocabulary expension and phone set transformation etc. Pipelines support various state-of-the-art techniques, such as time-delayed neural networks and Lattic-Free MMI objective funciton. In addition, we also release dev and test data from other channels(Android and Mic). For research community, we hope that AISHELL-2 corpus can be a solid resource for topics like transfer learning and robust ASR. For industry, we hope AISHELL-2 recipe can be a helpful reference for building meaningful industrial systems and products.

Abstract PDF Upgrade to Chat

Citations (259)

View on Semantic Scholar

Summary

The paper introduces a 1000-hour Mandarin speech dataset to advance robust industrial-scale ASR research.
It details an advanced Kaldi pipeline that combines GMM-HMM and TDNN with lattice-free MMI for enhanced acoustic modeling.
The research underscores the value of open-source datasets to bridge academic innovations with practical, cross-channel ASR solutions.

An Overview of AISHELL-2: Advancing Mandarin ASR Research

The paper "AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale" presents a comprehensive overview of the AISHELL-2 corpus, a significant contribution to automated speech recognition (ASR) research, particularly for Mandarin. This paper emerges from the necessity for large-scale, high-quality, open-source Mandarin speech datasets, similar to the role of ImageNet and COCO in computer vision. AISHELL-2 aims to bridge the gap between academic research and industrial applications by providing a robust dataset and accompanying tools tailored to the intricacies of Mandarin ASR.

Composition and Characteristics of AISHELL-2

AISHELL-2 is an expansive corpus comprising 1000 hours of clean read-speech data, featuring recordings from 1991 speakers across three acoustic channels: iOS, Android, and high-fidelity microphones. This diversity accommodates various speaker demographics and accents. The dataset is rich in content, covering eight primary topics, which enhances its applicability to multiple ASR applications. Notably, the development and test sets provide additional robustness, featuring balanced gender representation and diversity in speaking environments.

Technical Foundation and Methodologies

The paper introduces a detailed ASR pipeline developed as part of the AISHELL-2 framework, emphasizing state-of-the-art techniques integrated into the Kaldi toolkit. The pipeline includes:

Lexicon and Word Segmentation: The methodology leverages a sophisticated approach to Mandarin word segmentation using DaCiDian, an open-source Chinese dictionary that decouples words into PinYin syllables. This modularity allows researchers to customize and extend the lexicon with ease, facilitating experimentation with new vocabulary.
Acoustic Modeling: The paper outlines a two-phase approach involving GMM-HMM models followed by a neural network phase. The TDNN, featuring a lattice-free MMI objective function, serves as the cornerstone of the neural network phase, ensuring robust modeling of the acoustic features.
Language Modeling: A trigram LLM is developed on a substantial corpus of transcriptions, underscoring the importance of comprehensive LLMs in Mandarin ASR.

Empirical Evaluation

The system's efficacy is evaluated using character error rates (CER) across different channels, with results indicating superior performance on iOS data owing to channel condition matching. The GMM-HMM models achieve respectable CERs, which are notably enhanced by the advanced TDNN models.

Implications and Prospective Directions

By releasing AISHELL-2 and its corresponding Kaldi recipes, the authors provide a foundational resource for both academic and industrial stakeholders. It facilitates exploration into robust ASR techniques and the scalability of neural network-based methods for Mandarin. Importantly, AISHELL-2 lays groundwork for further research into transfer learning, enhanced language modeling, and cross-channel robustness, thereby extending its impact across broader contexts within the speech recognition landscape.

This paper showcases the vital role of open-source resources in enabling advancements within the ASR domain, particularly for languages with complex linguistic structures such as Mandarin. The availability of datasets like AISHELL-2 potentially accelerates development of more resilient ASR systems, bridging gaps between theoretical advancements and real-world applications. Future research might explore leveraging AISHELL-2 for developing cross-linguistic models or incorporating it into multilingual ASR systems to broaden its applicability in global communication technologies.

Markdown Report Issue