Anatomy of Industrial Scale Multilingual ASR (2404.09841v2)

Published 15 Apr 2024 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.

References (50)

Citations (6)

View on Semantic Scholar

Summary

The paper’s main contribution is the development of Universal-1, a multilingual ASR system achieving competitive WERs and efficient real-time performance.
It details a two-stage training process that leverages unsupervised, supervised, and pseudo-labeled data with a Conformer encoder and RNN-T decoder.
The system achieves 5x faster inference, reduces hallucinations by 30%, and effectively handles code-switching without dedicated training.

Exploring AssemblyAI's Multilingual ASR System: Universal-1

Introduction to Universal-1

AssemblyAI's paper describes the development and extensive evaluation of a new automatic speech recognition (ASR) system, named Universal-1. This ASR system is primarily highlighted for its multilingual capabilities, covering English, Spanish, German, and French, with a focus on achieving high accuracy, reduced word error rates (WERs), and efficient performance across various challenging conditions. Universal-1 leverages a Conformer encoder combined with an RNN-T decoder, a setup pre-trained on 12.5M hours of audio data and fine-tuned with an additional 1.8M hours, showcasing remarkable results against competitive models like Whisper and Canary-1B.

Model Architecture and Training

The architecture of Universal-1 uses a carefully chosen mix of unsupervised, supervised, and pseudo-labeled data to address the variety and complexity of real-world speech. It employs a full-context Conformer encoder with 600M parameters and an RNN-T decoder. The training approach is described as a two-stage process, accommodating a vast amount of pre-training audio data to harness the benefits of self-supervised learning (SSL) in conjunction with fine-tuning on labeled datasets. Crucial to its robust performance, the system also implements various strategies for dealing with ambient noise and accurate timestamp estimation.

Key Findings and Contributions

Competitive Performance: Universal-1 achieved competitive WERs across multiple languages and datasets, with significantly lesser parameters compared to its counterparts.
Inference Efficiency: The system boasts a 5x inference speedup and a 30% reduction in hallucinations over an optimized Whisper baseline, offering practical benefits for real-time applications.
Code-Switching Capability: An emergent capability of handling code-switching efficiently, even without explicit training on code-switched samples, was demonstrated, underscoring the model's versatile linguistic adaptability.

Practical Implications and Theoretical Insights

The system-centric approach adopted for analyzing ASR models allowed the authors to delve into practical aspects undervalued by traditional methods. This includes robustness to ambient noise, accurate timestamp estimation without the need for additional alignment models, and effectiveness in handling code-switching scenarios. Furthermore, the research underscores the substantive impact of scaling—both in terms of model parameters and dataset sizes—on improving ASR performance. Yet, it also implies that architectural choices and training methodologies can significantly offset the need for scale, hinting at a more nuanced relationship between model size, data quantity, and ASR quality than previously understood.

Future Directions in AI and ASR

Universal-1's achievements prompt several avenues for future exploration, particularly in refining multilingual ASR models and extending their capabilities to more languages and dialects. Investigating the implicit learning of code-switching, minimizing hallucinations further, and enhancing timestamp accuracy could lead to more sophisticated and universally applicable ASR systems. Additionally, exploring the diminishing returns of pre-training with massive datasets might provide valuable insights into optimal resource utilization for training state-of-the-art ASR systems.

Conclusion

In sum, Universal-1 represents a significant step forward in the pursuit of highly efficient, accurate, and versatile multilingual ASR systems. By judiciously combining architectural innovations with extensive training data, AssemblyAI has managed to make notable strides in addressing both long-standing and emerging challenges in the field of ASR. As ASR technology continues to evolve, the insights and methodologies shared through Universal-1 will undoubtedly influence future research and development endeavors within the AI community.

PDF Markdown

Related Papers

Tweets

https://twitter.com/robert_mchardy/status/1780123467289170195

https://twitter.com/_p0lar_bear/status/1780626095227695262

https://twitter.com/AssemblyAI/status/1780315395590680673

https://twitter.com/AssemblyAI/status/1782339941424406960

https://twitter.com/GOexle/status/1780180943585563071

https://twitter.com/arxivsanitybot/status/1780588884801294717