End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results

Published 4 Dec 2014 in cs.NE, cs.LG, and stat.ML | (1412.1602v1)

Abstract: We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.

Abstract PDF Upgrade to Chat

Citations (462)

View on Semantic Scholar

Summary

The paper presents an end-to-end approach that replaces HMMs with a bidirectional RNN integrated with an attention mechanism.
The paper achieves an 18.57% phoneme error rate on TIMIT, demonstrating the model's competitive performance using a streamlined architecture.
The paper implies a simplified deployment with reduced complexity, paving the way for scalable and real-time speech recognition systems.

End-to-End Continuous Speech Recognition Using Attention-Based Recurrent Neural Networks

This paper presents an investigation into replacing the traditional Hidden Markov Model (HMM) architecture in continuous speech recognition systems with a novel bidirectional recurrent neural network (RNN) framework. This framework employs an encoder-decoder architecture integrated with an attention mechanism, enabling end-to-end training without explicit input-output sequence alignment. The research underscores the capability of attention mechanisms in sequence processing tasks.

Methodology

The proposed model comprises three core components:

Encoder: Leveraging a bidirectional RNN to process input acoustic frames to extract features.
Attention Mechanism: Responsible for aligning and mapping parts of the input sequence to elements of the output sequence, reminiscent of neural machine translation techniques.
Decoder: A recurrent neural network that generates the sequence of output phonemes iteratively, using context vectors derived from the attention mechanism.

The attention mechanism plays a pivotal role by dynamically focusing on relevant portions of the input sequence, guided by a trainable scoring function. This is enhanced through gating functions and penalties to encourage monotonic alignments—a crucial feature for speech data where phonetic sequences exhibit temporal order.

Results

The model achieves a phoneme error rate of 18.57% on the TIMIT dataset, which is competitive with existing state-of-the-art HMM-DNN systems. Importantly, the model shows resilience to decoding inaccuracies even when employing narrow beam or greedy search techniques—a notable departure from the intricacies and dependencies of traditional methods that require broader search paradigms.

Implications and Future Directions

The implications of this research are multifold:

Theoretical Advancement: Demonstrates the viability of RNNs with attention mechanisms to bypass traditional per-frame prediction hassles, directly predicting phoneme sequences.
Practical Deployment: Simplifies application architecture by eliminating intricate components such as explicit HMM alignments, reducing the complexity and potentially lowering computational costs in real-time speech systems.
Future Prospects: Suggests potential extensions of this method to large vocabulary continuous speech recognition, leveraging a hierarchical RNN model that seamlessly transitions from phoneme to word-level recognition. This may simplify the modeling of word sequences while maintaining state-of-the-art accuracy, potentially reducing latency issues in speech applications.

This paper serves as a foundation for further exploration into RNN-based speech recognition, inspiring avenues for more seamless integration of deep learning techniques into traditionally statistical domains. As computational and data resources continue to grow, the fidelity and applicability of such models are expected to increase significantly.

Markdown Report Issue