LRS3-TED: a large-scale dataset for visual speech recognition (1809.00496v2)

Published 3 Sep 2018 in cs.CV

Abstract: This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.

Authors (3)

Triantafyllos Afouras (29 papers)
Joon Son Chung (106 papers)
Andrew Zisserman (248 papers)

Citations (379)

View on Semantic Scholar

Summary

The paper presents a 400-hour dataset from TED talks that sets a benchmark for lip reading and audio-visual speech recognition.
It details a robust pipeline using SSD-based face detection, precise audio-text alignment, and SyncNet for synchronization and speaker verification.
The dataset's scale and diverse, detailed annotations offer significant potential to enhance model accuracy and advance research in human-computer interaction.

Overview of LRS3-TED: A Large-Scale Dataset for Visual Speech Recognition

The paper presents the LRS3-TED dataset, a substantial multi-modal resource designed to advance research in visual and audio-visual speech recognition. Developed by the Visual Geometry Group at the University of Oxford, this dataset comprises over 400 hours of meticulously prepared TED and TEDx video content. It is distinctive not only for its remarkable scale but also for providing a comprehensive benchmark for comparative evaluation of lip-reading systems.

Dataset Composition

The LRS3-TED dataset is constructed from TED and TEDx talks, specifically chosen for their diverse speaker pool and continuous facial footage. With a total running time exceeding 400 hours, the dataset is organized into three primary subsets: pre-train, train-val, and test. Notably, the pre-train and train-val sets share content, whereas the test set remains entirely independent. The dataset provides face tracks in 224x224 resolution MP4 files, accompanied by synchronized 16-bit audio tracks and detailed textual transcripts that include word alignment boundaries.

Methodology

The authors employ a sophisticated pipeline to curate the dataset, incorporating several advanced techniques:

Video Processing: Utilizing an SSD-based CNN face detector, the authors extract face tracks from individual frames. Face tracks are generated within identified shot boundaries, determined by analyzing color histograms across frames.
Audio-Text Alignment: Ensuring accuracy in speech-text correspondence, the dataset uses human-generated subtitles, further refined by P2FA for word-level alignment. Kaldi-based ASR models provide an additional verification layer.
Synchronization and Speaker Verification: Due to potential audio-video desynchronization in source material, the authors leverage a SyncNet two-stream network to synchronize streams. This network also facilitates speaker verification by matching lip movements to audio, rejecting non-matching clips as voice-overs.
Sentence Extraction: The dataset segments videos into sentences or phrases based on punctuations in transcripts, ensuring a manageable clip length for practical model training.

Comparative Analysis

The LRS3-TED dataset stands out against existing datasets like GRID, MODALITY, LRW, and LRS2-BBC, primarily due to its unparalleled scale and comprehensive annotation. The inclusion of detailed word-level alignment and diverse vocabulary spans make it a versatile tool for research. Importantly, the choice of TED content ensures a heterogeneous speaker set, starkly contrasting with datasets derived from repetitive cast TV programs.

Implications and Future Prospects

The release of the LRS3-TED dataset invites significant implications for the field of visual speech recognition. Primarily, it provides a benchmark against which new models can be rigorously evaluated. It also opens avenues for advancements in speech enhancement and broader audio-visual learning tasks. As models continue to evolve, the dataset will likely spur improvements in the accuracy and robustness of lip reading systems.

Future research may explore expanding the dataset with non-English TED content or incorporating additional modalities such as gestures or multi-camera angles, further enriching the dataset's utility. Additionally, the methods employed in preparing the LRS3-TED dataset might be refined or adapted for generating similar resources from other video repositories.

The introduction of LRS3-TED marks a pivotal contribution to the domain of spatial-temporal learning, setting a high standard for forthcoming datasets. Researchers are now equipped with an expansive, versatile resource to foster innovations in human-computer interaction and related fields.

PDF Markdown