Vakyansh: ASR Toolkit for Low Resource Indic languages (2203.16512v2)

Published 30 Mar 2022 in cs.CL and eess.AS

Abstract: We present Vakyansh, an end to end toolkit for Speech Recognition in Indic languages. India is home to almost 121 languages and around 125 crore speakers. Yet most of the languages are low resource in terms of data and pretrained models. Through Vakyansh, we introduce automatic data pipelines for data creation, model training, model evaluation and deployment. We create 14,000 hours of speech data in 23 Indic languages and train wav2vec 2.0 based pretrained models. These pretrained models are then finetuned to create state of the art speech recognition models for 18 Indic languages which are followed by LLMs and punctuation restoration models. We open source all these resources with a mission that this will inspire the speech community to develop speech first applications using our ASR models in Indic languages.

Citations (11)

View on Semantic Scholar

Summary

The paper introduces Vakyansh, an ASR toolkit that automates data collection, preprocessing, and model training for low-resource Indic languages.
It employs a wav2vec 2.0-based architecture enhanced with language and punctuation models, achieving state-of-the-art ASR performance on 18 Indic languages.
The open-source toolkit fosters reproducibility, enabling researchers to build robust speech applications and extend the technology to diverse local dialects.

Vakyansh: An ASR Toolkit for Low Resource Indic Languages

The paper "Vakyansh: ASR Toolkit for Low Resource Indic languages" (2203.16512) introduces Vakyansh, an end-to-end toolkit designed for ASR in Indic languages. The primary focus is on addressing the challenges posed by the scarcity of data and pre-trained models for the numerous low-resource languages spoken in India. The toolkit encompasses automated data pipelines for data creation, model training, evaluation, and deployment. The authors created 14,000 hours of speech data across 23 Indic languages and trained wav2vec 2.0-based pre-trained models. These models were then fine-tuned to develop state-of-the-art ASR models for 18 Indic languages, supplemented by LLMs and punctuation restoration models. All resources are open-sourced to encourage the development of speech-first applications in Indic languages.

Data Collection and Processing

The paper identifies the lack of open-domain data as the main challenge in creating ASR models for low-resource Indic languages. To overcome this, the authors employed automated and semi-automated methods to create datasets, targeting 10,000 hours of labeled data for Hindi. The data collection methods include transcript generation for languages with existing ASR models, forced alignment of text and audio, and expert annotation of audio.

The authors define a set of rules for a high-quality ASR dataset:

Audio data from diverse sources/domains.
Audio chunk durations between 1 to 15 seconds.
Speaker contribution limited to 90 minutes to prevent overfitting.
Minimal background noise and music.
Balanced gender representation.
Language consistency between audio and desired ASR language.
Low-noise transcription.

An audio processing pipeline was developed to satisfy these criteria, prioritizing open-domain audio. The pipeline incorporates:

Data discovery using web scraping for open audios, especially from YouTube.
Voice Activity Detection (VAD) using WebRTC-VAD to split longer audios into smaller chunks based on silence, with an aggressiveness level of 2, a frame duration of 30ms, and padding duration of 300ms.
Signal to Noise Ratio (SNR) calculation using WADA SNR, with a threshold between 20 and 60 to filter out noisy chunks. 35% of the data was rejected at this stage.

Speaker clustering was performed using Resemblyzer to derive 256-length embeddings, followed by HDBSCAN clustering. This approach achieved approximately $96\%$ accuracy in speaker identification on a balanced 20-hour Hindi dataset with 80 speakers. Gender identification was performed using an SVM with RBF kernel, achieving 97% accuracy on an 8-hour test set from various languages. Data was labeled using commercial STT engines and, for very low-resource languages, by language experts. Additionally, a forced alignment pipeline was used for data with large audio files and corresponding unaligned transcripts, using the espeak TTS engine and DTW to align MFCC representations.

Text Processing and Normalization

The text post-processing pipeline includes:

Removal of punctuation and special characters without pronunciation.
Vocabulary definition with language experts, removing utterances with foreign characters.
Removal of numbers to ensure the model outputs numbers in words.
Text normalization from NFC to NFD using Indic-NLP-library to reduce vocabulary size and improve ASR for rare symbols.

During train/validation/test splitting, speaker overlap was avoided to ensure a fair estimate of model quality.

Model Architecture and Training

The experimentation platform is built on fairseq and uses wandb for tracking. The models are based on wav2vec 2.0, and the training process involves pre-training and fine-tuning. WER and CER are used as evaluation metrics. For pre-training, the base architecture of wav2vec 2.0 (12 blocks, 768 model dimension, 8 attention blocks) was used. The pre-training was initialized from a checkpoint trained on 960 hours of LibriSpeech data. The model was trained for 300,000 steps with a learning rate of 0.0005, using Adam optimization with a warmup of 32,000 steps and a diversity loss weight of $\alpha=0.1$ .

Fine-tuning was performed by adding a fully connected layer on top of the context network, with the output layer size equal to the language vocabulary, and optimized using CTC loss. The LLM consists of a statistical KenLM trained on the IndicCorp corpus, cleaned of characters/words not in the ASR vocabulary, and using the top 500,000 most frequent words. A beam search of 128 was used, with a word insertion penalty of -1 and LM weight of 2.

Results and Post-ASR Processing

The ASR output is further processed using punctuation restoration models and inverse text normalization. Punctuation restoration is posed as a token classification task, using IndicBERT. Rule-based WFST is used for ITN. The results show that the inclusion of a LLM improves both WER and CER across most of the 18 Indic languages tested. However, in some cases, the application of a LLM can increase WER if the test set text differs significantly from the LM training text.

Conclusions

The Vakyansh toolkit provides a strong foundation for researchers in the Indic speech community to develop speech applications in local languages. The open-source contribution aims to advance the state-of-the-art in Indic speech recognition and create resources for low-resource languages.