Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 161 tok/s

Gemini 2.5 Pro 40 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 117 tok/s Pro

Kimi K2 149 tok/s Pro

GPT OSS 120B 440 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Weakly Supervised Construction of ASR Systems with Massive Video Data (2008.01300v2)

Published 4 Aug 2020 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: Building Automatic Speech Recognition (ASR) systems from scratch is significantly challenging, mostly due to the time-consuming and financially-expensive process of annotating a large amount of audio data with transcripts. Although several unsupervised pre-training models have been proposed, applying such models directly might still be sub-optimal if more labeled, training data could be obtained without a large cost. In this paper, we present a weakly supervised framework for constructing ASR systems with massive video data. As videos often contain human-speech audios aligned with subtitles, we consider videos as an important knowledge source, and propose an effective approach to extract high-quality audios aligned with transcripts from videos based on Optical Character Recognition (OCR). The underlying ASR model can be fine-tuned to fit any domain-specific target training datasets after weakly supervised pre-training. Extensive experiments show that our framework can easily produce state-of-the-art results on six public datasets for Mandarin speech recognition.