Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

Published 8 Jun 2017 in cs.CL | (1706.02737v1)

Abstract: We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM LLM. We achieve a 5-10\% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.

Abstract PDF Upgrade to Chat

Citations (289)

View on Semantic Scholar

Summary

The paper presents a joint CTC-attention decoding strategy that integrates alignment-free attention with sequential CTC for enhanced ASR performance.
It employs a deep CNN encoder to improve feature extraction from audio, resulting in more accurate input representations.
The paper integrates an RNN-LM to jointly optimize sequence prediction, significantly reducing error rates on diverse speech datasets.

Advances in Joint CTC-Attention Based End-to-End Speech Recognition

The paper, "Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM," presents a refined approach to automatic speech recognition (ASR) through a blend of joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder models. This research marks a progression in the drive towards simplifying ASR architectures by using end-to-end neural networks, transcending the traditional multi-module ASR systems that rely heavily on linguistic expertise and complex probabilistic models.

Traditional ASR systems are often intricate assemblies of various components such as acoustic, lexicon, and LLMs integrated through processes like Weighted Finite-State Transducer (WFST). They typically include models like Hidden Markov Models (HMM), Gaussian Mixture Models (GMM), and Deep Neural Networks (DNN), all of which make system development labor-intensive, especially for less widely researched languages. In contrast, end-to-end ASR models, like those examined in this paper, strive to unify these complex modules into a cohesive neural network, reducing reliance on linguistic preprocessing.

The paper introduces significant experiments involving the integration of CTC and attention mechanisms for an end-to-end ASR framework. This approach addresses inherent issues in speech-to-text tasks by leveraging the alignment-free advantages of attention models with the sequential processing power of CTC. The authors extend prior models with three key enhancements:

Joint Decoding: The paper details methods combining CTC and attention scores during decoding, presenting two techniques for integrating CTC probabilities with attention: a rescoring method and a one-pass decoding process. These methods assist in finding accurate alignments during inference by incorporating additional alignment information from CTC.
Deep CNN Encoder: Integrating a VGG-like Convolutional Neural Network (CNN) structure into the encoder offers advancements by bolstering feature extraction from audio inputs. This leads to greater accuracy in encoding inputs before sequence modeling by BLSTM layers.
RNN-LM Integration: The model further benefits from an additional Recurrent Neural Network LLM (RNN-LM), which is either trained separately or conjointly with the decoder, enhancing sequence-level prediction accuracy.

The performance assessments conducted on spoken language tasks, specifically on the Corpus of Spontaneous Japanese (CSJ) and the HKUST Mandarin Chinese tasks, indicate that the joint CTC-attention approach provides substantive improvements over standalone attention mechanisms and even other state-of-the-art hybrid systems. Specifically, it achieves notable reductions in Character Error Rate (CER) across different datasets. Implementation of multi-task learning and augmentation techniques, such as speed perturbation, contribute to these outcomes, showcasing the model's robustness.

Overall, the paper demonstrates the potential of a streamlined ASR model architecture that is not only simpler to train and deploy but also capable of outperforming previous hybrid systems. By eliminating the need for explicit linguistic modeling components, this research paves the way for more accessible and adaptable ASR technologies. Future directions could focus on further optimization through leveraging additional unsupervised data for model pretraining and fine-tuning, enhancing generalization to more diverse linguistic scenarios.

The work exemplifies a significant step in ASR development, with practical implications for how speech technologies are deployed in varied linguistic and application settings, and presents pathways for future research that extend beyond current linguistic constraints and conventional model architectures.

Markdown Report Issue