Updated Corpora and Benchmarks for Long-Form Speech Recognition (2309.15013v1)

Published 26 Sep 2023 in cs.CL, cs.SD, and eess.AS

Abstract: The vast majority of ASR research uses corpora in which both the training and test data have been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio is not segmented, leading to a mismatch between inference-time conditions and models trained on segmented utterances. In this paper, we re-release three standard ASR corpora - TED-LIUM 3, Gigapeech, and VoxPopuli-en - with updated transcription and alignments to enable their use for long-form ASR research. We use these reconstituted corpora to study the train-test mismatch problem for transducers and attention-based encoder-decoders (AEDs), confirming that AEDs are more susceptible to this issue. Finally, we benchmark a simple long-form training for these models, showing its efficacy for model robustness under this domain shift.

Citations (5)

View on Semantic Scholar

Summary

The paper shows that updated corpora enable realistic long-form speech recognition by linking segments and expanding transcription gaps.
It details a methodology that merges contiguous audio and supplements data to bridge training and inference mismatches.
Experimental results reveal that while AED models improve with long-form training, transducers remain more robust against deletion errors.

An Examination of Updated Corpora and Benchmarks for Long-form Speech Recognition

The paper "Updated Corpora and Benchmarks for Long-form Speech Recognition" addresses a critical challenge in Automatic Speech Recognition (ASR): the mismatch between training and inference conditions due to pre-segmented utterances in training corpora versus the unsegmented nature of real-world audio. This is particularly pronounced in long-form speech recognition, where transcripts originate from extended audio segments. The authors endeavor to bridge this gap by re-releasing prominent ASR corpora—TED-LIUM 3, GigaSpeech, and VoxPopuli-en—with revised transcriptions and alignments to support long-form ASR research.

Core Contributions and Methodology

The authors highlight a significant issue in ASR research: many corpora used in training and evaluation lack segmentation integrity, thus impeding effective long-form ASR modeling. To rectify this, the paper introduces two essential processes: linking and expansion. Linking involves merging contiguous audio segments, and expansion entails incorporating additional audio and transcriptions when gaps exist. These rejuvenated corpora are intended to serve as more accurate benchmarks against current ASR models.

The paper's methodology section outlines the process of transforming the existing datasets into their long-form variants. For instance, linking in the GigaSpeech corpus was achieved by joining sequential segments, while expansion in TED-LIUM utilized external transcriptions to fill in the missing sections. Through these techniques, the paper not only expands the dataset size but significantly extends average segment length, furnishing a more authentic testbed for long-form ASR systems.

Experimental Results and Analysis

The authors benchmark two ASR model architectures: transducers and attention-based encoder-decoders (AEDs), under both short-form (segmented utterances) and long-form (unsegmented long recordings) conditions. Baseline results demonstrate that transducers outperform AEDs, with the latter showing substantial degradation in performance due to high deletion error rates in long-form evaluation contexts. When models were trained using these updated long-form segments, improvements were observed, especially for AEDs, which indicates that exposure to longer context sequences during training mitigates the mismatch-induced performance drop.

The data in Table~\ref{tab:lf_asr} adequately depicts the efficacy of this long-form training strategy, with substantial reductions in word error rates (WER) for AED models. Despite these improvements, transducers continue to exhibit robustness, arguably due to their frame-synchronous nature, which inherently aligns better with the extended context.

Implications and Future Directions

The proposed updates to established corpora facilitate more effective long-form ASR research and serve as a standardized benchmark to evaluate future models. This work underlines the importance of dataset authenticity in ASR research, advocating for the necessity of training data that reflects real-world audio conditions. The insights drawn from the comparison between transducers and AEDs will likely motivate further advancements in model architectures, focusing on robustness to context length variability.

Moreover, the ability to train with imperfect transcriptions, as suggested by the researchers, hints at potential future developments where models could learn to handle noisy or incomplete datasets—a significant step towards more resilient ASR systems.

In conclusion, this paper contributes substantively to the understanding and advancement of long-form ASR by realigning training conditions with real-world audio characteristics. By providing a more robust set of benchmarks, it lays the groundwork for future exploration into model optimization that can handle the variegated nature of human speech.

PDF Markdown

Related Papers

GitHub

GitHub - revdotcom/speech-datasets: Various speech datasets made available to the public (122 stars)

Tweets

https://twitter.com/rdesh26/status/1801621416770412900

https://twitter.com/rdesh26/status/1751255391730626848