First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs (1408.2873v2)

Published 12 Aug 2014 in cs.CL, cs.LG, and cs.NE

Abstract: We present a method to perform first-pass large vocabulary continuous speech recognition using only a neural network and LLM. Deep neural network acoustic models are now commonplace in HMM-based speech recognition systems, but building such systems is a complex, domain-specific task. Recent work demonstrated the feasibility of discarding the HMM sequence modeling framework by directly predicting transcript text from audio. This paper extends this approach in two ways. First, we demonstrate that a straightforward recurrent neural network architecture can achieve a high level of accuracy. Second, we propose and evaluate a modified prefix-search decoding algorithm. This approach to decoding enables first-pass speech recognition with a LLM, completely unaided by the cumbersome infrastructure of HMM-based systems. Experiments on the Wall Street Journal corpus demonstrate fairly competitive word error rates, and the importance of bi-directional network recurrence.

Citations (161)

View on Semantic Scholar

Summary

The paper presents a novel first-pass LVCSR system that integrates a language model directly through BRDNNs and a modified prefix-search decoding algorithm.
The study demonstrates that bi-directional recurrence significantly improves word and character error rates compared to traditional DNN and single-direction RNN models.
The approach simplifies the speech recognition pipeline by removing HMM dependencies, paving the way for more efficient and accessible LVCSR systems.

First-Pass Large Vocabulary Continuous Speech Recognition Using Bi-Directional Recurrent DNNs

This paper presents an approach to large vocabulary continuous speech recognition (LVCSR) using bi-directional recurrent deep neural networks (BRDNNs), eliminating the dependency on hidden Markov models (HMMs). Although deep neural network (DNN) acoustic models are widely used in conjunction with HMMs, building such systems is a highly complex, domain-specific task. Here, the authors propose using a straightforward recurrent neural network architecture coupled with a modified prefix-search decoding algorithm. This method allows first-pass speech recognition that integrates a LLM directly during decoding, bypassing the cumbersome HMM-based systems.

In recent developments, the use of connectionist temporal classification (CTC) has shown potential for direct sequence transduction from audio inputs to transcript characters, moving away from traditional HMM frameworks. This paper extends those developments by demonstrating that recurrent neural networks can be effectively employed for accurate results in speech recognition tasks. By leveraging bi-directional recurrence, the network maintains state both forwards and backwards in time, allowing a more comprehensive integration of information from the entirety of the input features.

The decoding algorithm developed is a significant component of the approach. It combines CTC-trained neural networks with LLMs, enabling the search for optimal transcription directly from audio inputs without relying on pre-generated word lattices from HMM systems. The results from experiments on the Wall Street Journal corpus highlight competitive word error rates (WER) and underscore the importance of bi-directional recurrence.

Key Results and Claims

Model Performance: Utilizing first-pass decoding with a LLM resulted in significant reductions in WER and character error rate (CER) compared to models without LLM constraints. In particular, implementing a bigram LLM improved the WER substantially, highlighting the effectiveness of the proposed approach.
Bi-Directional Recurrence: The BRDNN architecture showed improved CERs on both training and test sets over traditional DNN and single-direction RDNN models, underscoring the value of modeling temporal dependencies in both directions.

Implications and Future Directions

The proposed method presents practical and theoretical implications for speech recognition systems. Practically, it removes dependencies on HMM-based systems, simplifying implementation and potentially broadening access to advanced speech recognition technologies. Theoretically, it prompts further exploration into optimizing neural networks for sequence transduction tasks, particularly enhancing temporal modeling capabilities.

Future research could focus on refining decoding algorithms for efficiency and accuracy, exploring alternative network architectures that might offer further improvements, and examining the balance between latency and performance in BRDNNs, particularly for online applications. This work lays a foundation for advancing LVCSR systems by integrating neural networks with LLMs in novel and efficient ways and paves the way for further innovations in AI-driven speech recognition technologies.

PDF Markdown

First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs (1408.2873v2)

Summary

First-Pass Large Vocabulary Continuous Speech Recognition Using Bi-Directional Recurrent DNNs

Key Results and Claims

Implications and Future Directions

Related Papers