A Comprehensive Study of Deep Bidirectional LSTM RNNs for Acoustic Modeling in Speech Recognition

Published 22 Jun 2016 in cs.NE, cs.CL, cs.LG, and cs.SD | (1606.06871v2)

Abstract: We present a comprehensive study of deep bidirectional long short-term memory (LSTM) recurrent neural network (RNN) based acoustic models for automatic speech recognition (ASR). We study the effect of size and depth and train models of up to 8 layers. We investigate the training aspect and study different variants of optimization methods, batching, truncated backpropagation, different regularization techniques such as dropout and $L_2$ regularization, and different gradient clipping variants. The major part of the experimental analysis was performed on the Quaero corpus. Additional experiments also were performed on the Switchboard corpus. Our best LSTM model has a relative improvement in word error rate of over 14\% compared to our best feed-forward neural network (FFNN) baseline on the Quaero task. On this task, we get our best result with an 8 layer bidirectional LSTM and we show that a pretraining scheme with layer-wise construction helps for deep LSTMs. Finally we compare the training calculation time of many of the presented experiments in relation with recognition performance. All the experiments were done with RETURNN, the RWTH extensible training framework for universal recurrent neural networks in combination with RASR, the RWTH ASR toolkit.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (167)

View on Semantic Scholar

Summary

The paper presents a comprehensive study on deep bidirectional LSTM RNNs for acoustic modeling in speech recognition, demonstrating their superiority over traditional networks with over 15% relative WER reduction.
The study investigates various training strategies, including optimization algorithms, layer depth (finding optima at 4-6 layers or deeper with pretraining), and regularization techniques for deep BLSTMs.
A novel layer-wise pretraining scheme enables effective training of deeper BLSTM networks (over 6 layers), while specific configurations achieved a 16.3% WER on the Switchboard corpus, with implications for ASR system development.

Overview of Deep Bidirectional LSTM RNNs for Acoustic Modeling in Speech Recognition

This paper presents a detailed investigation into the training and effective utilization of deep bidirectional long short-term memory (BLSTM) recurrent neural networks for acoustic modeling in automatic speech recognition (ASR) tasks. By leveraging the extensive experiments conducted on the Quaero and Switchboard datasets, the researchers provide insights into the operational dynamics and optimization strategies for training deep BLSTMs.

Key Findings and Methodology

The study underscores the superiority of BLSTM networks over traditional feedforward neural networks (FFNNs) in reducing word error rates (WER). By employing deep networks of up to 10 layers, they achieved a relative improvement in WER by over 15% compared to the FFNN baseline. This paper investigates myriad optimization strategies, including Adam, MNSGD, and RMSprop, and examines the effects of truncated backpropagation, various batching configurations, and regularization techniques such as dropout and L2 regularization.

Detailed comparisons between unidirectional and bidirectional LSTMs highlight the latter’s enhanced performance. The paper introduces a novel pretraining scheme for layer-wise construction, yielding substantial improvements for networks with greater depth. Experiments corroborate the efficacy of pretraining, especially for networks exceeding 6 layers, enabling deeper network architectures that were previously unmanageable due to increased training complexity.

Numerical Results and Experiments

Extensive simulations demonstrate the optimal configuration choices:

Number of Layers: The optimum layer count for the BLSTM networks was identified to be between 4 to 6 layers for the given corpus, with significant improvements observed using a pretraining scheme when training deeper networks.
Layer Size: A hidden layer size of 500 provided a balanced trade-off between model complexity and performance, though larger sizes up to 700 enhanced WER marginally.
Optimization: Adam optimization emerged as a consistently reliable choice, benefiting from learning rate scheduling such as Newbob.

Additionally, the paper reports remarkable performance metrics on the Switchboard corpus, with a BLSTM model achieving 16.7% total WER. The associative LSTM variant further improved the performance to 16.3% WER.

Practical and Theoretical Implications

The findings hold potential implications for the development of robust ASR systems. The comprehensive exploration of BLSTM configurations could guide future research and practical applications in speech technology. Pretraining schemes, shown to enhance deeper architectures, may be further refined and adapted across other sequence processing tasks in AI, fostering advancements in NLP and related fields.

Future Directions

Future research could explore the integration of associative memory components within LSTMs, as preliminary findings suggested promising improvements. Furthermore, expanding investigations into various regularization techniques and optimization algorithms could unlock additional performance gains. The public accessibility of the training configurations offers a valuable resource for continued exploration and replication by the research community.

In summary, this paper makes significant contributions to understanding and optimizing deep BLSTM networks for acoustic modeling, providing a detailed account of the interdependencies and effects of various training strategies and configurations that elevate performance in real-world ASR tasks.

Markdown Report Issue