Transformer-based Acoustic Modeling for Hybrid Speech Recognition

Published 22 Oct 2019 in cs.CL and eess.AS | (1910.09799v2)

Abstract: We propose and evaluate transformer-based acoustic models (AMs) for hybrid speech recognition. Several modeling choices are discussed in this work, including various positional embedding methods and an iterated loss to enable training deep transformers. We also present a preliminary study of using limited right context in transformer models, which makes it possible for streaming applications. We demonstrate that on the widely used Librispeech benchmark, our transformer-based AM outperforms the best published hybrid result by 19% to 26% relative when the standard n-gram LLM (LM) is used. Combined with neural network LM for rescoring, our proposed approach achieves state-of-the-art results on Librispeech. Our findings are also confirmed on a much larger internal dataset.

Abstract PDF Upgrade to Chat

Authors (13)

Citations (234)

View on Semantic Scholar

Summary

The paper introduces novel transformer adaptations using iterated loss techniques and innovative positional encoding methods to stabilize deep acoustic models.
It demonstrates competitive performance, reducing word error rates by 19%-26% on the Librispeech benchmark compared to bi-directional LSTM systems.
The study outlines advancements toward streamable transformer ASR systems and validates scalability on a 13.7K-hour dataset, setting future research directions.

Transformer-Based Acoustic Modeling for Hybrid Speech Recognition

The paper presents a novel investigation into employing transformer-based acoustic models (AMs) within the framework of hybrid speech recognition systems. The authors propose various architectural adaptations and training techniques to optimize the application of transformers for acoustic modeling, assessing their performance against established benchmarks and exploring their compatibility with streaming applications.

Transformer Architecture in Acoustic Modeling

The transition from recurrent neural networks (RNNs), particularly Long Short-Term Memory networks (LSTMs), to transformer architectures in acoustic modeling represents a significant shift, primarily due to the self-attention mechanism. Unlike RNNs, which struggle with long temporal dependencies and sequential processing, transformers leverage self-attention to connect input elements directly, enabling parallel processing and efficient modeling of temporal dependencies.

Key Contributions and Experimental Outcomes

Several critical contributions are highlighted in the paper:

Modeling and Positional Encoding Innovations: The work explores different methodologies for injecting positional information into the transformer inputs. The authors experimented with sinusoidal positional embeddings, frame stacking, and convolutional embeddings, discovering that the latter offered superior performance by implicitly encoding relative positional information through layer-wise transformations.
Iterated Loss Technique: To facilitate the training of the deep transformer networks without convergence issues, the paper employs an iterated loss technique. This introduces auxiliary losses at various layers, interpolating them with the primary cross-entropy loss, thereby stabilizing training for deeper configurations.
Competitive Performance: On the widely-used Librispeech benchmark, the transformer-based acoustic model demonstrated significant reductions in word error rates (WER) compared to bi-directional LSTM baselines, achieving a 19% to 26% relative improvement when using a standard 4-gram LLM (LM). When combined with neural network LM for rescoring, the system established state-of-the-art performance on this dataset.
Scalability to Large Datasets: The proposed transformer architecture was evaluated on a large-scale internal dataset (13.7K hours of video data), corroborating their superior performance across curated, clean, and noisy subsets.
Streamable Transformer Models: Although preliminary, the paper makes strides toward developing streamable transformer-based ASR systems by exploring models with limited right context, essential for real-time applications.

Implications and Future Directions

The transformation from RNN-based to transformer-based architectures in audio processing opens new avenues for more efficient and parallelizable models. The improved performance of transformers over LSTMs, particularly in handling long-range dependencies and parallel processing, underscores the growing potential of self-attention mechanisms in the audio domain.

Future research paths include addressing the computational inefficiency inherent in transformers due to their quadratic complexity with respect to input length. Further exploration into achieving streamable ASR solutions using transformers while preserving their performance merits offers promising directions. Additionally, integrating these architectural innovations with neural transduction models may provide a comprehensive end-to-end solution capable of surpassing the limitations identified in conventional hybrid systems.

Overall, the advancement in transformer-based acoustic models as outlined in the paper represents a substantive progression in the field of automatic speech recognition, providing a flexible framework for future research and practical deployment in diverse audio processing tasks.

Markdown Report Issue