Vocoder-Based Speech Synthesis from Silent Videos

Published 6 Apr 2020 in eess.AS, cs.CV, and cs.LG | (2004.02541v2)

Abstract: Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm. To improve speech reconstruction performance, our model is also trained to predict text information in a multi-task learning fashion and it is able to simultaneously reconstruct and recognise speech in real time. The results in terms of estimated speech quality and intelligibility show the effectiveness of our method, which exhibits an improvement over existing video-to-speech approaches.

Abstract PDF Upgrade to Chat

Citations (29)

View on Semantic Scholar

Summary

The paper presents vid2voc, a novel framework that directly synthesizes speech from video frames without relying on intermediate text representations.
It employs a neural network architecture to estimate key acoustic features and uses the WORLD vocoder to generate real-time, high-quality speech.
Experiments on the GRID dataset demonstrate superior PESQ and ESTOI scores, validating its effectiveness in noisy and speaker-dependent scenarios.

Vocoder-Based Speech Synthesis from Silent Videos

The paper "Vocoder-Based Speech Synthesis from Silent Videos" presents a novel approach to reconstructing speech from silent video recordings using a deep learning framework. This study explores the correlation between acoustic and visual stimuli, aiming to address challenges in automatic speech generation from video-only inputs, which has practical applications in noise-dominated environments and for devices like hearing aids.

Methodology and Approach

The proposed system, dubbed "vid2voc," synthesizes speech directly from video frames without relying on an intermediate text representation. This feature distinguishes it from traditional two-step processes involving Visual Speech Recognition (VSR) followed by Text-to-Speech synthesis (TTS). Vid2voc estimates acoustic features necessary for speech synthesis—the spectral envelope, fundamental frequency, and aperiodic parameters—using a trained neural network architecture. These features are subsequently synthesized into audible speech leveraging the WORLD vocoder, a high-quality synthesis system suitable for real-time applications.

The architecture comprises a video encoder, a recursive temporal module, and multiple decoders focusing on different audio parameters. The introduction of a multi-task learning paradigm aims to enhance performance by incorporating a VSR task, concurrently predicting text from video, which may indirectly assist speech synthesis.

Experimental Setup and Results

The system was evaluated using the GRID audio-visual dataset under speaker-dependent and speaker-independent scenarios. The results were benchmarked against existing methods, such as those employing Generative Adversarial Networks (GANs) for video-driven speech reconstruction.

Key performance metrics included Perceptual Evaluation of Speech Quality (PESQ) and Extended Short-Time Objective Intelligibility (ESTOI). The vid2voc approach demonstrated superior performance in speech quality and intelligibility across both scenarios, particularly noticeable in speaker-dependent settings where it outperformed previous baselines significantly. Furthermore, the inclusion of the multi-task VSR decoder showed enhancements in speech reconstruction quality.

Discussion and Implications

The methodology highlights the potential of direct video-to-audio mappings, optimizing information extracted directly from visual cues for improved speech synthesis. This could substantially benefit real-time applications where processing speed and information retention (e.g., emotions and prosody) are critical.

Going forward, research can explore refining the multi-task learning approach to balance speech reconstruction with VSR tasks better and develop more generalized models to address speaker variability effectively. Integration with more sophisticated decoding schemes, such as beam search, could improve VSR accuracy. Expanding the dataset to include diverse environmental contexts will be imperative to enhance the system's robustness and applicability across real-world scenarios.

The study emphasizes leveraging cross-modal signals to enhance our understanding of human-computer interaction, potentially paving paths for advanced multimodal communication systems in artificial intelligence.

Markdown Report Issue