AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

Published 22 Oct 2020 in cs.SD and eess.AS | (2010.11567v2)

Abstract: In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is able to achieve zero-shot voice cloning. The system trained on this dataset also generalizes well on speakers that are never seen in the training process. Objective evaluation results from our experiments show that the proposed multi-speaker synthesis system achieves high voice similarity concerning both speaker embedding similarity and equal error rate measurement. The dataset, baseline system code and generated samples are available online.

Abstract PDF Upgrade to Chat

Citations (193)

View on Semantic Scholar

Summary

The paper introduces the AISHELL-3 dataset, an 85-hour multi-speaker Mandarin TTS corpus that enhances training for robust speech synthesis.
It integrates a Tacotron-2 based system with speaker embedding feedback, achieving a 4.56% EER for seen speakers and 9.46% for unseen voices.
Innovative preprocessing and prosody prediction techniques improve alignment and naturalness, supporting versatile applications in Mandarin TTS.

An Overview of AISHELL-3: A Multi-speaker Mandarin TTS Corpus and its Baselines

The paper introduces the AISHELL-3 dataset, a significant contribution to the field of multi-speaker speech synthesis, specifically focusing on Mandarin Chinese. This dataset comprises approximately 85 hours of high-quality recordings from 218 native Mandarin speakers. The recordings, made in a controlled acoustic environment, encompass text from various domains including smart home commands, news, and geographic specifics. This wide topical range enhances the corpus's applicability across diverse TTS systems. An essential feature of the dataset is the manually transcribed Chinese character and pinyin-level texts, paired with metadata such as gender, age, and regional accents.

AISHELL-3 bridges a gap in resources available for TTS systems tailored to non-English languages. Given the tonal complexities and specific phonetic variations in Mandarin, the corpus provides an indispensable resource for developing TTS systems capable of mimicking diverse speaker characteristics. The structured provision of speaker attributes facilitates robust ML model training.

The authors develop a baseline multi-speaker TTS system building on the Tacotron-2 framework integrated with a speaker verification model. Leveraging a speaker embedding feedback constraint, this system seeks to achieve zero-shot voice cloning, thereby enhancing its adaptability to previously unencountered speaker voices. The baseline architecture comprises a threefold subsystem: a speaker-agnostic frontend, a Tacotron-2 based acoustic model, and a neural vocoder. The prosody prediction and preprocessing enhance this subsystem's capability to handle variations in speech rhythms and intonations, especially relevant in Mandarin’s context.

Objective evaluations, including SV-EER metrics and cosine similarity measures, reveal promising results with notable voice similarity for both seen and unseen speakers in the training dataset. For instance, the system achieves an EER of 4.56% for validation set speakers, rising to 9.46% for test set speakers, indicating a certain performance dip when handling new speaker identities but overall maintaining reasonable similarity.

The paper also highlights several innovative dataset preparation techniques aimed at improving model training efficacy and generalization abilities. These include silence trimming, long-form sentence augmentation, and prosodic label prediction, which mitigate traditional challenges like alignment instability in longer utterances and monotonous prosody. Such enhancements support a more nuanced and natural synthetic speech output.

From a theoretical perspective, the research underscores the value of language-specific corpora for TTS improvements, especially in tonal languages such as Mandarin. Practically, AISHELL-3 supports a broad domain of applications from commercial voice assistants to automated narration systems within Mandarin-speaking contexts.

Future research may explore refining the discrepancy between synthetic and real speaker similarity for unseen speakers, possibly through augmented feedback constraints or more sophisticated speaker representation techniques. The authors' work invites further investigation into prosodic and phonetic modeling to elevate the naturalness and adaptability of TTS systems trained on AISHELL-3.

In conclusion, this paper provides a crucial dataset and baseline system, equipping researchers and practitioners with the tools to advance multi-speaker TTS research, especially fine-tuned to the intricacies of the Mandarin language. The dataset is positioned as a potentially pivotal resource to inform subsequent developments within the field of speech synthesis.

Markdown Report Issue