Self-Training for End-to-End Speech Recognition

Published 19 Sep 2019 in cs.CL, cs.LG, and eess.AS | (1909.09116v2)

Abstract: We revisit self-training in the context of end-to-end speech recognition. We demonstrate that training with pseudo-labels can substantially improve the accuracy of a baseline model. Key to our approach are a strong baseline acoustic and LLM used to generate the pseudo-labels, filtering mechanisms tailored to common errors from sequence-to-sequence models, and a novel ensemble approach to increase pseudo-label diversity. Experiments on the LibriSpeech corpus show that with an ensemble of four models and label filtering, self-training yields a 33.9% relative improvement in WER compared with a baseline trained on 100 hours of labelled data in the noisy speech setting. In the clean speech setting, self-training recovers 59.3% of the gap between the baseline and an oracle model, which is at least 93.8% relatively higher than what previous approaches can achieve.

Abstract PDF Upgrade to Chat

Citations (226)

View on Semantic Scholar

Summary

The paper introduces a self-training framework using pseudo-labels to enhance end-to-end ASR models with limited labelled data.
It employs heuristic and confidence-based filtering along with an ensemble approach to generate high-quality pseudo-labels and mitigate sequence errors.
Experiments on the LibriSpeech corpus show up to a 33.9% relative improvement in WER, underscoring the practical benefits of this method.

Self-Training for End-to-End Speech Recognition

The paper under review presents an exploration of self-training in the domain of end-to-end automatic speech recognition (ASR). The authors, Jacob Kahn, Ann Lee, and Awni Hannun, investigate the application of self-training using pseudo-labels to enhance the performance of sequence-to-sequence ASR models. The focus is laid on optimizing self-training to bridge the gap between models trained with limited labelled data and those trained with larger labelled datasets.

Key Contributions

Baseline and Pseudo-Label Generation: The researchers employ a strong baseline comprising robust acoustic and LLMs to generate pseudo-labels. This strong baseline is pivotal in ensuring the quality of self-generated labels, which subsequently influences model performance during self-training.
Label Filtering Mechanism: Two filtering strategies—heuristic and confidence-based methods—are applied to mitigate common sequence-to-sequence model errors such as erroneous looping and premature stopping. The filtering mechanism plays a critical role in removing noisy labels, thereby improving the quality of the pseudo-labels.
Ensemble Approach: The introduction of an ensemble method to diversify pseudo-labels is noteworthy. This method leverages multiple models to generate pseudo-labels, which enhances label diversity and prevents overconfidence in erroneous labels.

Experimental Setup

The experiments are conducted using the LibriSpeech corpus, with distinct clean and noisy speech settings. In the clean setting, a 33.9% relative improvement in WER (Word Error Rate) is noted. This setting involves 100 hours of labelled data combined with 360 hours of additional clean unlabelled data for training. There is a significant recovery—93.8% relative improvement—in bridging the gap between the baseline model and the oracle model in the clean setting. The challenging noisy speech setting, on the other hand, demonstrates the effectiveness of the filtering strategies in managing noise inherent in pseudo-labelled data.

Implications and Limitations

Practical Implications: The study underscores the potential of self-training to exploit large volumes of unlabelled audio—thereby circumventing the high costs associated with labelling. This is particularly beneficial in resource-constrained environments where labelled data are scarce.
Theoretical Insights: From a theoretical standpoint, the paper advances the understanding of end-to-end model training dynamics, particularly in scenarios with limited labelled data. The insights into filtering and ensemble methods can guide future research on the integration of semi-supervised strategies in ASR systems.

Future Directions

The findings suggest several promising areas for future research. Enhancing the robustness of self-training by integrating advances in domain adaptation could further improve performance in diverse acoustic environments. Moreover, extending the framework to multilingual ASR systems or incorporating more sophisticated confidence estimation algorithms could uncover additional gains in model accuracy and robustness.

In summary, the paper provides valuable insights into augmenting end-to-end speech recognition models using self-training. The methodological innovations presented form a concrete benchmark for future semi-supervised learning approaches in automatic speech recognition, offering both practical benefits and theoretical developments in the field.

Markdown Report Issue