Towards Accurate Lip-to-Speech Synthesis in-the-Wild (2403.01087v1)

Published 2 Mar 2024 in cs.MM, cs.CV, cs.SD, and eess.AS

Abstract: In this paper, we introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements. The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust LLM from speech alone, resulting in unsatisfactory outcomes. To overcome this issue, we propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model. The noisy text is generated using a pre-trained lip-to-text model, enabling our approach to work without text annotations during inference. We design a visual text-to-speech network that utilizes the visual stream to generate accurate speech, which is in-sync with the silent input video. We perform extensive experiments and ablation studies, demonstrating our approach's superiority over the current state-of-the-art methods on various benchmark datasets. Further, we demonstrate an essential practical application of our method in assistive technology by generating speech for an ALS patient who has lost the voice but can make mouth movements. Our demo video, code, and additional details can be found at \url{http://cvit.iiit.ac.in/research/projects/cvit-projects/ms-l2s-itw}.

References (36)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel visual TTS model that integrates noisy text predictions, significantly enhancing lip-to-speech accuracy over existing methods.
It fuses visual lip movements with comprehensive language cues from a pre-trained L2T model to produce synchronized and natural-sounding speech.
The approach shows practical impact, notably enabling speech synthesis for an ALS patient, which underlines its potential in assistive technology applications.

Toward More Accurate Lip-to-Speech Synthesis for In-the-Wild Scenarios

Introduction

The synthesis of speech from silent videos based solely on lip movements introduces a fascinating domain of lip-to-speech (L2S) generation, which is distinct from the more explored area of lip-to-text (L2T) conversion. While L2T focuses on generating textual transcriptions from silent videos, L2S aims to produce intelligible and natural speech, which aligns closely with the visible lip movements of speakers in diverse settings. This paper presents a novel approach to L2S that outperforms existing methods by incorporating text supervision through a pre-trained L2T model, thereby infusing the model with essential language information.

Key Contributions

The research introduces several significant contributions to the field of lip-to-speech synthesis:

Challenging Current Lip-to-Speech Approaches: It addresses the limitations of existing L2S models that struggle with learning language attributes solely from speech supervision by using noisy text predictions from a pre-trained L2T model.
Visual Text-to-Speech Model: The proposal includes a novel visual text-to-speech (TTS) network that synthesizes speech to match silent video inputs, significantly outperforming current methods in both qualitative and quantitative evaluations.
Empowering ALS Patients: Demonstrating a critical practical application, the method was used to generate speech for a patient with Amyotrophic Lateral Sclerosis (ALS), showcasing its potential in assistive technologies.

Methodological Innovations and Experimental Findings

Approach Overview

The paper’s approach integrates noisy text predictions, derived from a state-of-the-art L2T model, with visual features from silent videos to accurately generate speech that is synchronized with the lip movements. This method addresses the speech synthesis challenge from two angles: understanding the content to be spoken (through L2T) and determining the appropriate speaking style (through visual-TTS models conditioned on lip movements and text).

Superior Performance on Benchmarks

Extensive experiments across various datasets revealed that the proposed approach significantly improves upon the existing state-of-the-art methods in L2S. Especially notable is its performance in "in-the-wild" scenarios, which involve diverse speakers, lighting conditions, and backgrounds.

Theoretical and Practical Implications

The findings have broad implications, both theoretically and practically. Theoretically, this work elucidates the importance of incorporating language information via noisy text predictions for enhancing L2S systems' accuracy. Practically, it demonstrates the feasibility of providing a voice to individuals unable to speak due to medical conditions, thereby significantly impacting assistive technology fields.

Future Directions

The paper speculates on future research directions, emphasizing the extension to multiple languages and further refinement of the visual-TTS model for even greater accuracy and naturalness of generated speech. The potential to minimize the reliance on text annotations, perhaps through advancements in self-supervised learning, also offers a promising area for continued exploration.

Conclusion

This research sets a new benchmark for lip-to-speech synthesis, especially in unconstrained, multi-speaker scenarios. By effectively leveraging pre-trained lip-to-text models for language and visual feature extraction, the proposed method achieves unprecedented levels of accuracy and naturalness in generated speech. The demonstrated application for ALS patients serves as a testament to the method's practical utility and its potential to offer substantial benefits to individuals with speech impairments. This work not only advances the state of L2S research but also opens avenues for its application in various user-centric and assistive technologies.

PDF Markdown