Speech Recognition with Deep Recurrent Neural Networks

Published 22 Mar 2013 in cs.NE and cs.CL | (1303.5778v1)

Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates \emph{deep recurrent neural networks}, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (8,330)

View on Semantic Scholar

Summary

The paper introduces deep bidirectional LSTM RNN architectures that significantly lower phoneme error rates on the TIMIT benchmark.
It demonstrates that combining Connectionist Temporal Classification and RNN Transducer methods in a five-layer network achieves a record-low 17.7% error rate.
The study underscores the importance of deep, end-to-end trained networks in capturing long-range temporal dependencies for robust speech recognition.

Speech Recognition with Deep Recurrent Neural Networks

The paper "Speech Recognition with Deep Recurrent Neural Networks" by Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton investigates the application of deep recurrent neural networks (RNNs) to the domain of speech recognition. Their study centers on exploring whether deep RNNs, specifically Long Short-term Memory (LSTM) RNNs, can achieve superior performance in sequence labeling tasks involving speech data. This paper is significant due to its detailed analysis of end-to-end training methods combined with deep network architectures to improve the robustness and accuracy of speech recognition systems.

Introduction

The authors acknowledge the long-standing utilization of neural networks in conjunction with hidden Markov models (HMMs) for speech recognition. Nevertheless, deep feedforward networks have recently garnered attention for substantial advancements in acoustic modeling. Given the dynamic nature of speech, the authors propose the potential suitability of RNNs for this task due to their capability to handle sequential data and their inherent temporal depth. However, prior attempts to integrate RNNs, especially HMM-RNN hybrids, have not consistently outperformed deep networks.

Methodology

The core premise of the research is to ascertain the applicability of deep bidirectional LSTM RNNs for speech recognition. The authors focus on the TIMIT phoneme recognition benchmark to evaluate their models. This investigation introduced multiple levels of deep RNNs to exploit both long-range temporal dependencies and spatial depth. The architectures examined include both unidirectional and bidirectional LSTMs, and several training methods, such as Connectionist Temporal Classification (CTC) and RNN Transducer. Key enhancements involve end-to-end training processes that enable the neural networks to learn directly from acoustic sequences without relying on predefined alignments.

Network Architectures and Training

The authors offer a thorough exposition of the LSTM cell architecture, bidirectional RNNs, and the deep stacked LSTM framework. The equations governing the forward and backward pass computations illustrate the dependencies among input vectors, hidden states, and output sequences. The bidirectional RNNs leverage both past and future context, critically enhancing the model's ability to process entire acoustic sequences.

Additionally, they examine two methods to define the distribution over output sequences: CTC and RNN Transducer. Both methodologies aim to mitigate alignment constraints and enable flexible mapping from inputs to phonetic outputs. The study demonstrates improvements when combining these methods with deep LSTM architectures.

Results

Results from the TIMIT dataset are summarized, showcasing significant performance variations across different network configurations. Key findings indicate:

Increasing the number of hidden layers from one to five yields improved phoneme error rates (PER), with the lowest error rate of 18.4% observed for CTC with a five-layer bidirectional LSTM.
LSTM cells surpass $\tanh$ units in effectiveness, with bidirectional structures performing slightly better than unidirectional ones.
Pretraining models with CTC before applying the transducer method provides further reductions in error rates, achieving a new best-performed error rate of 17.7%.

Discussion

The empirical results underscore the advantage of adding depth to LSTM RNN architectures, affirming that deeper networks capture progressively higher-level representations. Additionally, these deep bidirectional LSTM networks exhibit strong performance improvements over traditional RNNs and previous state-of-the-art deep networks.

Conclusions and Future Work

The paper concludes that deep, bidirectional LSTM RNNs trained end-to-end deliver state-of-the-art performance for phoneme recognition. The findings encourage extending these methods to larger vocabulary speech recognition tasks. Future research directions could involve integrating frequency-domain convolutional neural networks with deep LSTM, offering a promising avenue for further improvements in speech recognition systems.

Overall, this paper provides valuable insights into the improved performance of deep LSTM architectures in speech recognition, offering a robust foundation for future advancements in the field of neural network-based acoustic modeling.

Markdown Report Issue