TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation (1805.04699v4)

Published 12 May 2018 in cs.CL

Abstract: In this paper, we present TED-LIUM release 3 corpus dedicated to speech recognition in English, that multiplies by more than two the available data to train acoustic models in comparison with TED-LIUM 2. We present the recent development on Automatic Speech Recognition (ASR) systems in comparison with the two previous releases of the TED-LIUM Corpus from 2012 and 2014. We demonstrate that, passing from 207 to 452 hours of transcribed speech training data is really more useful for end-to-end ASR systems than for HMM-based state-of-the-art ones, even if the HMM-based ASR system still outperforms end-to-end ASR system when the size of audio training data is 452 hours, with respectively a Word Error Rate (WER) of 6.6% and 13.7%. Last, we propose two repartitions of the TED-LIUM release 3 corpus: the legacy one that is the same as the one existing in release 2, and a new one, calibrated and designed to make experiments on speaker adaptation. Like the two first releases, TED-LIUM 3 corpus will be freely available for the research community.

Citations (322)

View on Semantic Scholar

Summary

The paper demonstrates the main contribution of doubling the dataset to 452 hours, significantly boosting neural ASR performance with WER improvements.
It details a dual distribution approach, maintaining a legacy corpus while introducing a specialized partition for speaker adaptation research.
Experimental results reveal that while HMM-based systems plateau around 6.7% WER, end-to-end models benefit notably, reducing WER from 20.3% to 13.7%.

Overview of TED-LIUM 3: Data Expansion and Speaker Adaptation Exploration

The paper, "TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation," presents the third iteration of the TED-LIUM corpus, which significantly expands its dataset for automatic speech recognition (ASR) research. This new release doubles the transcribed speech data from previous versions, targeting enhanced end-to-end ASR systems and speaker adaptation capabilities.

TED-LIUM 3 introduces 452 hours of aligned audio data, compared to the 207 hours provided in release 2. This substantial increase in training data was shown to be particularly beneficial to end-to-end ASR systems. While Hidden Markov Model (HMM)-based systems continue to outperform end-to-end models with a Word Error Rate (WER) of 6.7% as opposed to 13.7% for the neural-based model, the addition of more data considerably improves the performance of purely neural approaches.

Detailed Contributions

The corpus offers two main distributions:

Legacy Distribution: Maintains consistency with previous releases, facilitating straightforward comparison of experimental results.
Speaker Adaptation Distribution: Tailored for exploring various speaker adaptation strategies, such as employing i-vectors and feature-space maximum linear regression (fMLLR).

Experimental Insights

The experiments underline several crucial points:

HMM-based Systems: Despite doubling the training data, HMM-based systems showed only marginal improvements, decreasing WER from 6.8% (release 2) to 6.7% (release 3). The plateau observed suggests limitations in leveraging additional data for this architecture.
End-to-End Neural ASR: This approach is demonstratively data-hungry. The WER fell from 20.3% when using TED-LIUM 2 data, to 13.7% with TED-LIUM 3, indicating a more pronounced benefit from increased data availability.
Speaker Adaptation Techniques: The speaker adaptation experiments demonstrated improvements in WER when employing adaptation methods. For instance, TDNN-LSTM with i-vector adaptation yielded notable improvements.

Implications and Future Directions

Practically, TED-LIUM 3's richer dataset holds substantial promise for enhancing ASR model performance, notably those based on neural architectures. As end-to-end systems demonstrate marked improvements with increased data, further expansion of datasets could continue to narrow the performance gap with traditional methods.

The focal shift from refining HMM-based systems to investing in deep learning methodologies reflects broader trends in AI, where capacity to scale with data volume is pivotal. The speaker adaptation component is equally critical, aiming to personalize ASR systems, thus making them more robust to individual speaker variances.

Looking forward, this research suggests pathways for further ASR development through increased data augmentation and model refinement. The open-source availability of TED-LIUM 3 promises to catalyze advances not only in academic settings but also in applied and commercial ASR applications. Future studies may delve into more specialized adaptation strategies or innovative neural architectures that further exploit the expanded data volume and explore additional dimensions such as multilingual capabilities and cross-domain performance generalization.

PDF Markdown