Emergent Mind

Anatomy of Industrial Scale Multilingual ASR

(2404.09841)
Published Apr 15, 2024 in eess.AS , cs.CL , cs.LG , and cs.SD

Abstract

This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.

Comparison of error reduction in Universal-1 vs. Whisper large-v3 and Canary-1B for various Ns.

Overview

  • AssemblyAI's Universal-1 is a new automatic speech recognition (ASR) system with multilingual capabilities, focused on achieving high accuracy and reduced word error rates (WERs) across English, Spanish, German, and French.

  • The system combines a Conformer encoder with an RNN-T decoder, pre-trained on 12.5M hours of audio data and fine-tuned with 1.8M hours, to deliver remarkable performance improvements over existing models.

  • Universal-1 demonstrates competitive WERs across languages, a 5x increase in inference speed, 30% reduction in hallucinations over baselines, and the ability to efficiently handle code-switching.

  • Future directions for the technology include refining multilingual ASR models, extending capabilities to more languages, and improving efficiency and accuracy in code-switching and timestamp estimation.

Exploring AssemblyAI's Multilingual ASR System: Universal-1

Introduction to Universal-1

AssemblyAI's paper describes the development and extensive evaluation of a new automatic speech recognition (ASR) system, named Universal-1. This ASR system is primarily highlighted for its multilingual capabilities, covering English, Spanish, German, and French, with a focus on achieving high accuracy, reduced word error rates (WERs), and efficient performance across various challenging conditions. Universal-1 leverages a Conformer encoder combined with an RNN-T decoder, a setup pre-trained on 12.5M hours of audio data and fine-tuned with an additional 1.8M hours, showcasing remarkable results against competitive models like Whisper and Canary-1B.

Model Architecture and Training

The architecture of Universal-1 uses a carefully chosen mix of unsupervised, supervised, and pseudo-labeled data to address the variety and complexity of real-world speech. It employs a full-context Conformer encoder with 600M parameters and an RNN-T decoder. The training approach is described as a two-stage process, accommodating a vast amount of pre-training audio data to harness the benefits of self-supervised learning (SSL) in conjunction with fine-tuning on labeled datasets. Crucial to its robust performance, the system also implements various strategies for dealing with ambient noise and accurate timestamp estimation.

Key Findings and Contributions

  • Competitive Performance: Universal-1 achieved competitive WERs across multiple languages and datasets, with significantly lesser parameters compared to its counterparts.
  • Inference Efficiency: The system boasts a 5x inference speedup and a 30% reduction in hallucinations over an optimized Whisper baseline, offering practical benefits for real-time applications.
  • Code-Switching Capability: An emergent capability of handling code-switching efficiently, even without explicit training on code-switched samples, was demonstrated, underscoring the model's versatile linguistic adaptability.

Practical Implications and Theoretical Insights

The system-centric approach adopted for analyzing ASR models allowed the authors to delve into practical aspects undervalued by traditional methods. This includes robustness to ambient noise, accurate timestamp estimation without the need for additional alignment models, and effectiveness in handling code-switching scenarios. Furthermore, the research underscores the substantive impact of scaling—both in terms of model parameters and dataset sizes—on improving ASR performance. Yet, it also implies that architectural choices and training methodologies can significantly offset the need for scale, hinting at a more nuanced relationship between model size, data quantity, and ASR quality than previously understood.

Future Directions in AI and ASR

Universal-1's achievements prompt several avenues for future exploration, particularly in refining multilingual ASR models and extending their capabilities to more languages and dialects. Investigating the implicit learning of code-switching, minimizing hallucinations further, and enhancing timestamp accuracy could lead to more sophisticated and universally applicable ASR systems. Additionally, exploring the diminishing returns of pre-training with massive datasets might provide valuable insights into optimal resource utilization for training state-of-the-art ASR systems.

Conclusion

In sum, Universal-1 represents a significant step forward in the pursuit of highly efficient, accurate, and versatile multilingual ASR systems. By judiciously combining architectural innovations with extensive training data, AssemblyAI has managed to make notable strides in addressing both long-standing and emerging challenges in the field of ASR. As ASR technology continues to evolve, the insights and methodologies shared through Universal-1 will undoubtedly influence future research and development endeavors within the AI community.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.