End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

Published 19 Nov 2019 in cs.CL, cs.SD, and eess.AS | (1911.08460v3)

Abstract: We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance with the supervised dataset alone, semi-supervision improves all models across architectures and loss functions and bridges much of the performance gaps between them. In doing so, we reach a new state-of-the-art for end-to-end acoustic models decoded with an external LLM in the standard supervised learning setting, and a new absolute state-of-the-art with semi-supervised training. Finally, we study the effect of leveraging different amounts of unlabeled audio, propose several ways of evaluating the characteristics of unlabeled audio which improve acoustic modeling, and show that acoustic models trained with more audio rely less on external LLMs.

Abstract PDF Upgrade to Chat

Authors (9)

Citations (241)

View on Semantic Scholar

Summary

The paper introduces a pseudo-labeling approach that enhances ASR performance across diverse architectures using both supervised and semi-supervised learning.
Experiments demonstrate that while Transformers excel in supervised training, semi-supervised methods effectively bridge performance gaps among ResNet, TDS, and Transformer models.
Increasing unlabeled audio data reduces dependence on external language models, suggesting a trend toward more robust, integrated acoustic representations.

End-to-End ASR: From Supervised to Semi-Supervised Learning with Modern Architectures

The study conducted by Synnaeve et al. explores the implementation and effectiveness of pseudo-labeling techniques in the semi-supervised training of modern end-to-end speech recognition architectures. These architectures include ResNet, Time-Depth Separable Convolutional Networks (ConvNets), and Transformers, evaluated using Connectionist Temporal Classification (CTC) and sequence-to-sequence (Seq2Seq) loss functions. The paper presents comprehensive experiments on the LibriSpeech dataset and utilizes additional unlabeled data from LibriVox, thereby advancing the field of automatic speech recognition (ASR) through innovative semi-supervised methodologies.

Key Findings

The authors report that Transformer-based models demonstrate superior performance when trained solely on supervised data. However, adopting semi-supervised learning techniques yields consistent performance improvements across all studied model architectures and loss functions. Importantly, the integration of a semi-supervised framework bridges notable performance disparities between different model types. Such improvements culminated in achieving a state-of-the-art Word Error Rate (WER) for end-to-end acoustic models when decoded with an external LLM in both supervised and semi-supervised learning contexts.

The study also explores the impacts of varying unlabeled data volumes on model performance. They demonstrate that acoustic models trained with increased audio data exhibit reduced dependence on external LLMs, indicating more robust acoustic representations.

Experimental Approach

The experiments leverage the LibriSpeech dataset for initial model training under the supervised paradigm, complemented by pseudo-labeling techniques for unlabeled audio data drawn from the LibriVox repository. The self-training method engages a pre-trained model to generate pseudo-labels for the unlabeled corpus they then train on. This methodology allows the acoustic models to attain performance levels comparable to those reliant on traditional, more complex training pipelines involving force alignment.

The analysis compares model architectures and loss functions using a shared set of 10k word pieces generated from the SentencePiece toolkit, ensuring consistent input across models. ResNet models, known for their utility in computer vision, are adapted for speech tasks by using 1-D convolutions and dropout regularization to effectively address the challenges of vanishing gradients in deep networks. TDS convolution models expand the model capacity through strategic adjustments in channel count, while Transformer models employ a hierarchical structure with attention mechanisms to enhance representation learning across vast timeframes.

Implications and Future Directions

The research sets a precedent in end-to-end ASR systems by demonstrating the feasibility and advantages of semi-supervised learning frameworks. The indication that well-trained acoustic models may lessen their reliance on LLMs suggests potential evolutions in ASR architectures, where acoustic modeling and language understanding could converge more seamlessly, potentially through joint training methods.

Future research could explore refining pseudo-label generation through improved beam-search strategies and imposing differentiation between LM and AM fluencies. Additionally, exploring new model architectures or hybrid approaches that fuse the strengths of multiple architectures in a semi-supervised context could further push ASR performance boundaries.

By virtue of these contributions, this study not only enhances the performance of contemporary ASR systems but also suggests foundational methodologies for employing unlabeled data in a domain renowned for its data-intensive training regimes.

Markdown Report Issue