Lip Reading Sentences in the Wild

Published 16 Nov 2016 in cs.CV | (1611.05358v2)

Abstract: The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS) network that learns to transcribe videos of mouth motion to characters; (2) a curriculum learning strategy to accelerate training and to reduce overfitting; (3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition, consisting of over 100,000 natural sentences from British television. The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin. This lip reading performance beats a professional lip reader on videos from BBC television, and we also demonstrate that visual information helps to improve speech recognition performance even when the audio is available.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (739)

View on Semantic Scholar

Summary

The paper introduces the WLAS network, a dual attention model that transcribes mouth movements to text and significantly reduces error rates using visual input alone.
The study employs a curriculum learning strategy to speed up training and manage complexity, progressively increasing sequence challenges for improved accuracy.
The release of the extensive LRS dataset with over 100,000 diverse sentences enables robust training, making the model competitive with professional lip readers in noisy conditions.

Lip Reading Sentences in the Wild: A Summary

The paper "Lip Reading Sentences in the Wild" by Joon Son Chung et al. introduces a novel approach to visual speech recognition, targeting the recognition of phrases and sentences from visual inputs without relying on audio. The authors address lip reading as an open-world problem, dealing with unconstrained natural language sentences in "wild" video contexts, a significant departure from prior research focusing on limited vocabularies and controlled environments.

Key Contributions

Watch, Listen, Attend and Spell (WLAS) Network: The core innovation is the WLAS network, which transcribes mouth motion in videos to characters. This model employs a dual attention mechanism, enabling it to process both visual and auditory inputs independently or concurrently, thereby improving transcription accuracy.
Curriculum Learning Strategy: To overcome the challenges associated with training deep neural networks on large temporal sequences, the authors propose a curriculum learning strategy. This method accelerates training and mitigates overfitting by starting with simpler tasks (short sequences) and progressively increasing complexity.
Lip Reading Sentences (LRS) Dataset: The introduction of the LRS dataset is a substantial advancement for the field. It comprises over 100,000 sentences extracted from British television broadcasts, providing a rich and diverse source for training and evaluation. This dataset is freely available, fostering further research in visual speech recognition.

Numerical Results and Claims

The experimental results demonstrate that the WLAS model significantly surpasses previous benchmarks. Specifically:

The character error rate (CER) of the WLAS model on the LRS dataset is 39.5% using visual input alone, a substantial improvement over prior models.
When both visual and auditory inputs are employed, the WLAS model achieves a CER of 7.9% with clean audio, which is further notable as it outperforms professional lip readers.
The study also confirms that visual cues enhance speech recognition performance in noisy environments, with WER improvements from 17.7% (audio-only in 10dB SNR) to 13.9% (audio-visual).

Implications and Future Directions

This research holds significant implications for both theoretical and practical applications in AI and computer vision:

Enhancing Automated Speech Recognition (ASR): Integrating visual information boosts ASR performance, particularly in noisy settings where audio signals might be compromised. This can revolutionize applications in automotive user interfaces, allowing for effective voice command recognition in noisy environments.
Applications in Accessibility: Automated lip reading can substantially benefit the deaf and hard-of-hearing community by providing more accurate real-time subtitles in video communication tools and aiding in the understanding of spoken content without relying on auditory input.
Potential in Silent Film Restoration: The capability to transcribe silent videos can be utilized in restoring and dubbing archival silent films, preserving cultural heritage.

Future Developments

The future of AI-driven lip reading might explore several promising avenues:

Monotonic Attention Mechanisms: Introducing constraints to ensure monotonic progression in attention vectors could refine alignment accuracy, particularly in languages with strict syllabary structures.
Online Decoding Models: Adapting the WLAS architecture to process and decode sequences in real-time, rather than in batch mode, can enable on-the-fly transcription in live broadcasts and real-time communication.
Rich Multimodal Datasets: Expanding diverse and large-scale datasets that encapsulate a variety of speaking conditions, languages, and dialects could further improve the robustness and generalizability of lip reading models.

In summary, the paper presents a significant advancement in the field of visual speech recognition, underpinned by the development of a powerful dual-stream neural network and the introduction of a large-scale, naturalistic dataset. The implications are far-reaching, setting the stage for enhanced ASR systems and novel applications leveraging hybrid auditory-visual processing.

Markdown Report Issue