Two-Pass End-to-End Speech Recognition

Published 29 Aug 2019 in cs.CL, cs.SD, and eess.AS | (1908.10992v1)

Abstract: The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models [1]. However, this model still lags behind a large state-of-the-art conventional model in quality [2]. On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models [3]. This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a second-pass component, while still abiding by latency constraints. Our proposed two-pass model achieves a 17%-22% relative reduction in WER compared to RNN-T alone and increases latency by a small fraction over RNN-T.

Abstract PDF Upgrade to Chat

Authors (12)

Citations (144)

View on Semantic Scholar

Summary

The paper demonstrates a two-pass architecture that leverages an LAS decoder to refine RNN-T predictions, achieving a 17%-22% reduction in word error rate.
It employs a shared multi-layer LSTM encoder and explores both independent beam search and rescoring modes to optimize accuracy and computational costs.
Experimental results on real voice traffic highlight the model’s potential for on-device, real-time speech recognition with minimal latency increases.

A Technical Examination of Two-Pass End-to-End Speech Recognition

The paper "Two-Pass End-to-End Speech Recognition" by Tara N. Sainath and colleagues from Google, Inc., addresses the challenges in speech recognition systems that simultaneously prioritize low word error rate (WER) and low latency, crucial for applications requiring real-time processing. The research introduces a two-pass architecture integrating a Listen, Attend and Spell (LAS) model as a second-pass component to an existing Recurrent Neural Network Transducer (RNN-T) setup. This novel approach mitigates the performance gap between streaming E2E models and traditional, more computationally demanding systems.

Model Architecture and Implementation

The proposed framework features a shared encoder architecture which effectively utilizes a multi-layer Long Short-Term Memory (LSTM) network. The first pass involves a streaming RNN-T decoder that processes acoustic frames to produce initial transcription predictions. In contrast, the second pass incorporates a LAS decoder designed to refine these preliminary predictions.

Two distinct inference modes for the LAS decoder were investigated: 2nd beam search and rescoring. In 2nd beam search mode, the LAS decoder operates independently of the RNN-T decoder outputs, while in rescoring, LAS uses top-K hypotheses produced by the RNN-T, applying an attention mechanism to refine predictions. Both methods require balancing improvements in WER with implications on computational costs, particularly the acceptable increase in latency.

Experimental Findings

The research conducted comprehensive experiments using extensive datasets representative of Google's voice search traffic, inclusive of both short and long utterances, to ascertain model performance. The two-pass system achieved a significant 17% to 22% reduction in WER compared to an RNN-T only setup, attributable to the strategic integration of the LAS decoder for rescoring.

The implementation of MWER (Minimum Word Error Rate) training further enhanced the LAS component by focusing on optimizing hypothesized sequence error likelihood. The MWER training approach effectively refines sequence prediction accuracy, yielding a notable improvement, especially on long utterances (LU), which are typically challenging for attention-based models.

Practical and Theoretical Implications

The two-pass E2E model presents a practical solution for on-device speech recognition systems requiring rapid response times without compromising transcription accuracy. The model elegantly combines the granularity of the RNN-T system's streaming capabilities with the LAS model's comprehensive language understanding.

This research indicates future pathways for more sophisticated multi-pass architectures incorporating enhancements in LLM integration and adaptive beam strategies to minimize latency further. It reflects a growing trend toward developing truly integrated E2E speech systems, offering both robust performance and computational efficiency convenient for mobile deployments.

Conclusion

Sainath et al.'s contribution exemplifies a practical architectural augmentation that bridges the performance disparity between conventional and E2E models. While the improvements in WER alongside a manageable latency increase underscore its viability, the comparative evaluation with a large conventional model reveals the nuances in real-world application and deployment feasibility. Continued exploration in context-aware speech modeling could yield greater integration of such systems into diverse, everyday applications.

Markdown Report Issue