Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition

Published 5 Sep 2018 in eess.AS, cs.LG, cs.SD, eess.IV, and stat.ML | (1809.01728v3)

Abstract: Automatic speech recognition can potentially benefit from the lip motion patterns, complementing acoustic speech to improve the overall recognition performance, particularly in noise. In this paper we propose an audio-visual fusion strategy that goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to enhanced representations which increase the recognition accuracy in both clean and noisy conditions. We test our strategy on the TCD-TIMIT and LRS2 datasets, designed for large vocabulary continuous speech recognition, applying three types of noise at different power ratios. We also exploit state of the art Sequence-to-Sequence architectures, showing that our method can be easily integrated. Results show relative improvements from 7% up to 30% on TCD-TIMIT over the acoustic modality alone, depending on the acoustic noise level. We anticipate that the fusion strategy can easily generalise to many other multimodal tasks which involve correlated modalities. Code available online on GitHub: https://github.com/georgesterpu/Sigmedia-AVSR

Abstract PDF Upgrade to Chat

Authors (3)

Citations (63)

View on Semantic Scholar

Summary

The paper introduces an attention-based fusion method that enhances ASR by aligning audio and visual cues.
It employs RNNs with residual-connected CNNs to extract and synchronize lip features with acoustic inputs.
Empirical results show up to a 30% improvement in CER on TCD-TIMIT, demonstrating robustness in challenging noise.

Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition

The paper "Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition" by George Sterpu, Christian Saam, and Naomi Harte, presents a multifaceted strategy for enhancing automatic speech recognition (ASR) systems. Their research exploits both audio and visual modalities through an attention-based method to yield improved speech recognition in both clean and noisy environments.

Core Contributions and Methodology

The researchers address two primary challenges in the field of Automatic Audio-Visual Speech Recognition (AVSR): the determination of optimal visual features for Large Vocabulary Continuous Speech Recognition (LVCSR) and the development of an effective fusion strategy for synchronizing multiple modalities operating at distinct frame rates. The authors propose an audio-visual fusion method leveraging Recurrent Neural Networks (RNNs) and sequence-to-sequence (Seq2seq) architectures with attention mechanisms to enrich audio-based representations with visual information. This novel approach aims to surpass the simple feature concatenation method, providing correlated modality alignment at every time step.

Key elements of their audio-visual fusion strategy include:

Integration of Visual Modality: Use of residual-connected Convolutional Neural Networks (CNNs) to extract high-level visual features from the lip region of face images, which are then synchronized with the acoustic inputs using RNN encoders.
Attention Mechanisms: Employing attention mechanisms not only in decoding but also in the encoding process. This enables the acoustic encoder to align with visual encoder representations, enhancing the feature set used for decoding tasks without burdening the decoder with modality correlation tasks.
Implementation and Testing: The approach is validated using two prominent datasets, TCD-TIMIT and LRS2, which offer varying complexity in terms of vocabulary and recording conditions. Experimental results support the hypothesis that their fusion strategy offers significant improvements, especially in the presence of noise.

Empirical Results and Practical Implications

The experimental results highlight the success of the proposed method in challenging noise conditions. For example, the researchers report relative improvements in Character Error Rate (CER) up to 30% over acoustic-only systems on the TCD-TIMIT dataset. This substantial enhancement underscores the impact of utilizing the visual modality in environments where noise significantly degrades audio clarity. Importantly, the system maintains robustness across different types of noise, such as white, café, and street noise.

However, despite the improved robustness on TCD-TIMIT, such enhancements were not observed in the LRS2 dataset. The authors attribute this to potential limitations in the visual front-end's ability to handle the more diverse and challenging video footage intrinsic to LRS2.

Theoretical and Future Directions

This research expands the theoretical understanding of multimodal fusion in speech recognition tasks, emphasizing the potential of attention mechanisms in learning intricate modality alignments. It challenges the classical paradigm of simple feature concatenation by demonstrating a model that actively learns synchronization patterns between different data sources.

Future research could focus on refining visual feature extraction techniques and exploring more sophisticated attention-based fusion strategies that dynamically assess the reliability of each modality under varying conditions. Furthermore, extending this framework to other multimodal interfaces and tasks could provide valuable insights into the generalizability of attention-driven fusion in technologically challenging environments.

By integrating these findings, future advancements could yield more resilient ASR systems capable of handling even greater variability in the user environment, thus extending the practical applications of audio-visual systems in settings ranging from mobile devices to cybersecurity. The authors make hopeful speculations about the method's applicability to broader domains where semantic interactions teach each modality to compensate or enhance the other, thus paving the way for more enriched machine learning models in AI.

Markdown Report Issue