Streaming automatic speech recognition with the transformer model

Published 8 Jan 2020 in cs.SD, cs.CL, cs.LG, eess.AS, and stat.ML | (2001.02674v5)

Abstract: Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context information, has been shown to achieve significantly lower word error rates (WERs) compared to recurrent neural network (RNN) based system architectures. Despite its success, the practical usage is limited to offline ASR tasks, since encoder-decoder architectures typically require an entire speech utterance as input. In this work, we propose a transformer based end-to-end ASR system for streaming ASR, where an output must be generated shortly after each spoken word. To achieve this, we apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech, which to our knowledge is the best published streaming end-to-end ASR result for this task.

Abstract PDF Upgrade to Chat

Citations (178)

View on Semantic Scholar

Summary

The paper presents a novel approach for streaming ASR by modifying transformer self-attention to limit future context and using triggered attention in decoding.
It achieves state-of-the-art results with WERs of 2.8% on clean and 7.2% on other LibriSpeech test sets, balancing accuracy with latency.
The study highlights the effectiveness of joint CTC-triggered attention decoding, enhanced by SpecAugment and an RNN LM for improved recognition.

Streaming Automatic Speech Recognition with the Transformer Model

This paper presents a study on transforming automatic speech recognition (ASR) systems from offline to online processing using a transformer-based architecture. The authors propose a novel approach by utilizing the transformer architecture, known for its success in static ASR contexts, for real-time streaming applications. This is achieved through a strategic modification of the self-attention mechanism to incorporate time-restricted elements in the encoder and triggered attention (TA) in the decoder.

The significant contributions of this paper revolve around creating a practical end-to-end streaming ASR system. Traditionally, encoder-decoder models require complete speech segments to perform effectively, which limits their application to offline scenarios. The authors address this by introducing time-restricted self-attention to the encoder, which provides a method to control latency by limiting the future context of input sequence processing. Triggered attention in the decoder works hand-in-hand, leveraging alignment information to ensure streaming output.

The technical sophistication of the proposed system leads to notably improved performance metrics. Utilizing LibriSpeech as the benchmark dataset, the streaming transformer model demonstrates word error rates (WERs) of 2.8% and 7.2% on the "clean" and "other" test sets, respectively. These results are indicative of the system's proficiency, showing the lowest published streaming ASR errors for the task. This is achieved through careful tuning of model parameters, including different encoder and decoder look-ahead settings, which balance recognition accuracy and latency.

The experimental results delineate how joint CTC-triggered attention decoding outperforms standalone CTC or attention decoding methods. Moreover, the impact of additional training techniques such as SpecAugment and integration of an RNN LLM (LM) further enhance the model's performance. The paper rigorously evaluates varied setups, highlighting the importance of parameter optimization in achieving low-latency, high-accuracy recognition.

Implications of this work are substantial in both theoretical and practical spheres. Theoretically, the research expands the boundary of transformer architectures in sequence-to-sequence learning by effectively adapting them for streaming applications. Practically, it lays a foundation for deploying ASR systems in fields demanding real-time processing such as telecommunications, automated transcription services, and interactive voice-based applications.

Moving forward, further improvements may be sought through investigating user-perceived latency and optimization of triggered attention mechanisms for diverse datasets and linguistic contexts. The general applicability of time-restricted self-attention and TA concepts to other domains beyond ASR suggests intriguing avenues for research in AI systems prioritizing low-latency responses. This work serves both as a significant step in streaming ASR advancement and a catalyst for future inquiry into real-time machine learning applications.

Markdown Report Issue