Improved training of end-to-end attention models for speech recognition

Published 8 May 2018 in cs.CL, cs.LG, and stat.ML | (1805.03294v1)

Abstract: Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) LLMs on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a LLM.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (266)

View on Semantic Scholar

Summary

The paper presents a novel pretraining scheme that progressively decreases the time reduction factor to ensure robust model convergence.
It leverages subword units via BPE and incorporates an auxiliary CTC loss to enhance open-vocabulary recognition and training stability.
The approach attains competitive WERs on Switchboard and LibriSpeech, with up to 27% improvement using shallow fusion with an external LSTM language model.

Improved Training of End-to-End Attention Models for Speech Recognition

The research paper titled "Improved training of end-to-end attention models for speech recognition" advances the application of sequence-to-sequence models in automatic speech recognition (ASR) by leveraging subword units. It addresses challenges associated with traditional ASR systems and demonstrates competitive performance on benchmark tasks, specifically Switchboard 300h and LibriSpeech 1000h, with the aid of a novel pretraining scheme and the incorporation of LLMs.

Overview of the Approach

The authors focus on attention-based encoder-decoder models, which facilitate end-to-end training and remove the dependency on complex components like pronunciation lexicons, prevalent in hybrid HMM/NN systems. Through the use of subword units generated via byte-pair encoding (BPE), these models achieve open-vocabulary recognition, enabling the identification of words outside the training corpus.

Significantly, the paper introduces a pretraining methodology that progressively decreases the time reduction factor during training. This process is identified as a critical component for ensuring model convergence and optimal performance. The research further explores the utilization of an auxiliary CTC loss to promote training stability.

Key Results

The paper reports state-of-the-art word error rates (WERs) of 3.54% on the dev-clean subset and 3.82% on the test-clean subset of LibriSpeech. For the Switchboard 300h dataset, the attention models prove effective, particularly on the simpler Switchboard subset of the evaluation. Despite not surpassing conventional methods in all metrics, the system demonstrates robust performance enhancements when integrated with an external LSTM LLM through shallow fusion, resulting in up to 27% relative improvements over the baseline.

Implications and Future Directions

Practically, these findings suggest that with the optimized training techniques and appropriate integration of LLMs, attention-based models can achieve results comparable to more complex, multi-component systems. Theoretically, this work underscores the potential of subword units in ASR tasks and hints at further improvements through enhanced pretraining and LLM integration strategies.

For future research, it would be pertinent to explore methods for reducing the computational complexity associated with attention-based decoding, especially in scenarios of extensive input sequences like speech. Additionally, exploring mechanisms for achieving alignment monotonicity without performance trade-offs could offer a significant leap forward in this domain.

Overall, this paper contributes a novel approach to improving end-to-end speech recognition, providing a solid foundation for both practical application and further theoretical exploration in the field of ASR.

Markdown Report Issue