Fast offline Transformer-based end-to-end automatic speech recognition for real-world applications (2101.05600v3)
Abstract: With the recent advances in technology, automatic speech recognition (ASR) has been widely used in real-world applications. The efficiency of converting large amounts of speech into text accurately with limited resources has become more important than ever. This paper proposes a method to rapidly recognize a large speech database via a Transformer-based end-to-end model. Transformers have improved the state-of-the-art performance in many fields. However, they are not easy to use for long sequences. In this paper, various techniques to speed up the recognition of real-world speeches are proposed and tested, including decoding via multiple-utterance batched beam search, detecting end-of-speech based on a connectionist temporal classification (CTC), restricting the CTC prefix score, and splitting long speeches into short segments. Experiments are conducted with the Librispeech English and the real-world Korean ASR tasks to verify the proposed methods. From the experiments, the proposed system can convert 8 hours of speeches spoken at real-world meetings into text in less than 3 minutes with a 10.73% character error rate, which is 27.1% relatively lower than that of conventional systems.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.