Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RWTH ASR Systems for LibriSpeech: Hybrid vs Attention -- w/o Data Augmentation (1905.03072v3)

Published 8 May 2019 in cs.CL, cs.SD, and eess.AS

Abstract: We present state-of-the-art automatic speech recognition (ASR) systems employing a standard hybrid DNN/HMM architecture compared to an attention-based encoder-decoder design for the LibriSpeech task. Detailed descriptions of the system development, including model design, pretraining schemes, training schedules, and optimization approaches are provided for both system architectures. Both hybrid DNN/HMM and attention-based systems employ bi-directional LSTMs for acoustic modeling/encoding. For LLMing, we employ both LSTM and Transformer based architectures. All our systems are built using RWTHs open-source toolkits RASR and RETURNN. To the best knowledge of the authors, the results obtained when training on the full LibriSpeech training set, are the best published currently, both for the hybrid DNN/HMM and the attention-based systems. Our single hybrid system even outperforms previous results obtained from combining eight single systems. Our comparison shows that on the LibriSpeech 960h task, the hybrid DNN/HMM system outperforms the attention-based system by 15% relative on the clean and 40% relative on the other test sets in terms of word error rate. Moreover, experiments on a reduced 100h-subset of the LibriSpeech training corpus even show a more pronounced margin between the hybrid DNN/HMM and attention-based architectures.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Christoph Lüscher (10 papers)
  2. Eugen Beck (9 papers)
  3. Kazuki Irie (35 papers)
  4. Markus Kitza (3 papers)
  5. Wilfried Michel (12 papers)
  6. Albert Zeyer (20 papers)
  7. Ralf Schlüter (73 papers)
  8. Hermann Ney (104 papers)
Citations (232)

Summary

  • The paper demonstrates that the hybrid DNN/HMM system, enhanced by LSTM language models and sequence-discriminative training, significantly lowers WER compared to attention-based models.
  • Empirical findings reveal over 15% improvement on clean and more than 40% on noisy test sets for hybrid systems, underscoring their robustness.
  • The study details precise training schedules and optimization strategies, offering actionable insights for integrating augmentation with neural ASR architectures.

Evaluation of RWTH ASR Systems for LibriSpeech: A Comparative Study of Hybrid and Attention Models Without Data Augmentation

The paper under consideration presents a rigorous analysis of two state-of-the-art Automatic Speech Recognition (ASR) system architectures applied to the LibriSpeech dataset, namely the hybrid Deep Neural Network/Hidden Markov Model (DNN/HMM) and the attention-based encoder-decoder approach. This paper merits attention from the ASR research community owing to its comparative insights into these two approaches without recourse to data augmentation, a prevailing technique in speech recognition.

Methodological Overview

The paper meticulously builds each system component, elucidating training schedules and optimization strategies. The hybrid DNN/HMM model exploits a conventional Gaussian Mixture Model/Hidden Markov Model (GMM/HMM) system for initial alignments. This system employs acoustic modeling via bi-directional Long Short-Term Memory (LSTM) networks and integrates an exhaustive sequence of enhancements such as speaker adaptation techniques and lattice-based sequence-discriminative training to refine performance with 12k Clustered Acoustic Regression Tree (CART) labels identified as optimal. LLMing is conducted using both statistical 4-gram models and more sophisticated LSTM networks.

Conversely, the attention-based system utilizes a sub-word level encoder-decoder configuration focusing on Byte-Pair Encoding (BPE) units. This model benefits from extensive pretraining and curriculum learning strategies to bolster representation learning. The paper also explores enhancing the baseline performance of end-to-end models through neural LLMs relying on LSTM and Transformers for shallow fusion into the recognition process.

Empirical Results

The hybrid DNN/HMM system demonstrates superior Word Error Rate (WER) results compared to the attention-based system, achieving improvements of over 15% on clean test sets and more than 40% on noisier conditions. Furthermore, the hybrid model's utilization of sequence discriminator training combined with a robust LSTM LLM and subsequent lattice rescoring with a Transformer-based LLM results in further advancements in recognition accuracy, establishing lower WERs than previously reported in literature.

The attention-based model still exhibits competitive performance among end-to-end ASR systems; however, it falls short relative to hybrid systems under the experimental constraints set by this paper. It was noted that the application of advanced LLMs, particularly Transformers, considerably augments the performance of the attention systems and lessens disparity with hybrid systems.

Discussion and Implications

The present research reinforces the current dominance of hybrid DNN/HMM systems over attention-based end-to-end models, notwithstanding advancements in the latter. This finding holds particularly true when data augmentation is omitted, as highlighted by the focus of this paper. The results imply that hybrid models have superior efficacy when exploiting existing frameworks and structural nuances within combined modeling strategies, particularly essential validation against clean and varied data conditions.

Theoretical implications suggest that despite the potential of end-to-end solutions for modeling simplicity and reduced reliance on domain expertise, further research is necessary to close the performance gap with hybrid models optimized via data-intensified traditional transformations. Future exploration could consider integrating these novel architectural enhancements with augmentation techniques to leverage their inherent strengths fully.

In conclusion, the developments chronicled in this paper represent a significant contribution to the field of ASR, delineating the operational benchmarks of hybrid systems over entirely neural architectures within specified dataset constraints. As such, it cultivates a fertile ground for further investigation into optimized augmentation and alternative learning paradigms in pursuit of enhanced intelligent ASR solutions.