Unsupervised Speech Recognition (2105.11084v3)

Published 24 May 2021 in cs.CL, cs.SD, and eess.AS

Abstract: Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. The right representations are key to the success of our method. Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago. We also experiment on nine other languages, including low-resource languages such as Kyrgyz, Swahili and Tatar.

Citations (255)

View on Semantic Scholar

Summary

The paper presents a novel unsupervised approach that maps unlabeled audio to phonemes using self-supervised wav2vec 2.0 and adversarial training.
It achieves impressive results, reducing phoneme error rate from 26.1 to 11.3 on TIMIT and obtaining a 5.9 word error rate on Librispeech test-other.
The research advances low-resource speech recognition by eliminating the need for labeled data, paving the way for broader language applications.

Overview of "Unsupervised Speech Recognition"

This paper introduces "wav2vec-U," a novel approach to training speech recognition models without labeled data. The method leverages self-supervised representations to map unlabeled audio to phonemes using adversarial training. This approach notably advances unsupervised automatic speech recognition (ASR), showcasing substantial improvements over previous methodologies.

Core Contributions

Self-supervised Representations: The research utilizes wav2vec 2.0 for self-supervised speech audio representation, significantly impacting the segmentation and mapping of speech to phonemes.
Adversarial Training: The model employs adversarial techniques to learn phoneme mappings, presenting a novel implementation of GANs in unsupervised ASR.
Unsupervised Metric for Model Validation: A unique cross-validation approach allows model development without labeled data, using LLM-based fluency and vocabulary usage to guide training.

Numerical Achievements

The paper reports a reduction in phoneme error rate (PER) on the TIMIT dataset from 26.1 to 11.3 compared to previous best unsupervised models. On the Librispeech benchmark, wav2vec-U achieves a word error rate (WER) of 5.9 on test-other, rivaling former supervised systems trained on 960 hours of labeled data.

Methodological Details

Speech Segmentation: Utilizes k-means clustering on wav2vec 2.0 representations, followed by mean-pooling and PCA for robust segment representations.
Text Pre-processing: Involves phonemization with silence token insertion, improving alignment with audio data processing.
Model Architecture: The generator is a simple convolutional neural network, indicating efficiency in parameter use, and outputs phoneme distributions from frozen wav2vec 2.0 representations.
Performance Across Languages: The method proves effective across multiple languages in the MLS dataset, demonstrating its versatility and robustness in low-resource settings like Kyrgyz and Tatar.

Implications and Future Directions

The findings suggest potential for expanding speech recognition capabilities to a vast number of world languages, currently underserved due to reliance on labeled datasets. Future research could explore:

Cross-lingual Phonemization Strategies: Addressing the dependence on language-specific phonemizers by developing universal phonemization techniques.
Segmentation Optimization: Refining segmentation techniques could further improve phoneme mapping precision, benefiting from variable-sized representation learning.
Enhanced Self-training: Further iterations and refinements in self-training strategies could yield improvements, particularly in low-resource settings.

This research represents a significant stride towards democratizing speech recognition technology, emphasizing the role of unsupervised methods in advancing AI models for global linguistic diversity.