Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

LRS3-TED: a large-scale dataset for visual speech recognition (1809.00496v2)

Published 3 Sep 2018 in cs.CV

Abstract: This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.

Citations (379)

Summary

  • The paper presents a 400-hour dataset from TED talks that sets a benchmark for lip reading and audio-visual speech recognition.
  • It details a robust pipeline using SSD-based face detection, precise audio-text alignment, and SyncNet for synchronization and speaker verification.
  • The dataset's scale and diverse, detailed annotations offer significant potential to enhance model accuracy and advance research in human-computer interaction.

Overview of LRS3-TED: A Large-Scale Dataset for Visual Speech Recognition

The paper presents the LRS3-TED dataset, a substantial multi-modal resource designed to advance research in visual and audio-visual speech recognition. Developed by the Visual Geometry Group at the University of Oxford, this dataset comprises over 400 hours of meticulously prepared TED and TEDx video content. It is distinctive not only for its remarkable scale but also for providing a comprehensive benchmark for comparative evaluation of lip-reading systems.

Dataset Composition

The LRS3-TED dataset is constructed from TED and TEDx talks, specifically chosen for their diverse speaker pool and continuous facial footage. With a total running time exceeding 400 hours, the dataset is organized into three primary subsets: pre-train, train-val, and test. Notably, the pre-train and train-val sets share content, whereas the test set remains entirely independent. The dataset provides face tracks in 224x224 resolution MP4 files, accompanied by synchronized 16-bit audio tracks and detailed textual transcripts that include word alignment boundaries.

Methodology

The authors employ a sophisticated pipeline to curate the dataset, incorporating several advanced techniques:

  1. Video Processing: Utilizing an SSD-based CNN face detector, the authors extract face tracks from individual frames. Face tracks are generated within identified shot boundaries, determined by analyzing color histograms across frames.
  2. Audio-Text Alignment: Ensuring accuracy in speech-text correspondence, the dataset uses human-generated subtitles, further refined by P2FA for word-level alignment. Kaldi-based ASR models provide an additional verification layer.
  3. Synchronization and Speaker Verification: Due to potential audio-video desynchronization in source material, the authors leverage a SyncNet two-stream network to synchronize streams. This network also facilitates speaker verification by matching lip movements to audio, rejecting non-matching clips as voice-overs.
  4. Sentence Extraction: The dataset segments videos into sentences or phrases based on punctuations in transcripts, ensuring a manageable clip length for practical model training.

Comparative Analysis

The LRS3-TED dataset stands out against existing datasets like GRID, MODALITY, LRW, and LRS2-BBC, primarily due to its unparalleled scale and comprehensive annotation. The inclusion of detailed word-level alignment and diverse vocabulary spans make it a versatile tool for research. Importantly, the choice of TED content ensures a heterogeneous speaker set, starkly contrasting with datasets derived from repetitive cast TV programs.

Implications and Future Prospects

The release of the LRS3-TED dataset invites significant implications for the field of visual speech recognition. Primarily, it provides a benchmark against which new models can be rigorously evaluated. It also opens avenues for advancements in speech enhancement and broader audio-visual learning tasks. As models continue to evolve, the dataset will likely spur improvements in the accuracy and robustness of lip reading systems.

Future research may explore expanding the dataset with non-English TED content or incorporating additional modalities such as gestures or multi-camera angles, further enriching the dataset's utility. Additionally, the methods employed in preparing the LRS3-TED dataset might be refined or adapted for generating similar resources from other video repositories.

The introduction of LRS3-TED marks a pivotal contribution to the domain of spatial-temporal learning, setting a high standard for forthcoming datasets. Researchers are now equipped with an expansive, versatile resource to foster innovations in human-computer interaction and related fields.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube