Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition (1911.04890v1)

Published 8 Nov 2019 in eess.AS, cs.CL, cs.CV, cs.LG, and cs.SD

Abstract: This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Takaki Makino (7 papers)
  2. Hank Liao (13 papers)
  3. Yannis Assael (11 papers)
  4. Brendan Shillingford (16 papers)
  5. Basilio Garcia (1 paper)
  6. Otavio Braga (8 papers)
  7. Olivier Siohan (13 papers)
Citations (118)

Summary

We haven't generated a summary for this paper yet.