Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Published 2 Mar 2023 in cs.CL, cs.SD, and eess.AS | (2303.01037v3)

Abstract: We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.

Abstract PDF Upgrade to Chat

Authors (27)

First 10 authors:

Citations (229)

View on Semantic Scholar

Summary

The paper presents USM's main contribution of scaling ASR to over 100 languages using 12 million hours of unlabeled audio.
The paper leverages BEST-RQ pre-training, multi-objective supervised techniques, and chunk-wise attention to efficiently align speech and text representations.
The paper demonstrates state-of-the-art performance on multilingual ASR benchmarks like FLEURS and CoVoST 2, notably boosting recognition in low-resource languages.

Evaluating the Universal Speech Model: Scaling Automatic Speech Recognition Across 100+ Languages

The development of the Universal Speech Model (USM) represents a substantial effort in extending automatic speech recognition (ASR) systems to operate across more than a hundred languages. Achieving this level of scale involved a meticulous approach in leveraging both large amounts of unlabeled multilingual data and a smaller proportion of labeled data, effectively bridging the gap between resource-rich and resource-poor languages.

Core Techniques and Contributions

The USM is predicated on a series of methodological advances and the integration of extensive datasets. The central strategy involves pre-training a large-scale encoder using a massive corpus of 12 million hours of untranscribed audio covering over 300 languages. This pre-training is essential in capturing diverse speech patterns without the costly requirement of labeled data.

The model leverages several cutting-edge techniques:

BEST-RQ Pre-training: A BERT-based approach that replaces the conventional quantization step with a random-projection quantization strategy. Employing multi-softmax layers, this technique enhances stability and efficiency in training large models.
Multi-Objective Supervised Pre-Training (MOST): This approach combines BEST-RQ with text-injection, aligning speech and text representations within a shared embedding space. This alignment is crucial for facilitating generalization across various downstream tasks, including both ASR and automatic speech translation (AST).
Chunk-wise Attention: Implemented to address ASR's long-form degradation problem, this method restricts attention mechanisms to specific audio chunks, thus improving the robustness of the model on extended audio inputs.

Experimental Evaluations

In introspecting ASR performance, the authors highlight several key results. Notably, the USM models demonstrate state-of-the-art efficacy on multilingual ASR tasks when benchmarked against datasets such as FLEURS and YouTube. The models also perform robustly against Whisper and other competing architectures despite employing a reduced volume of labeled data, showcasing the USM's efficiency in utilizing untranscribed data for pre-training.

The performance of USM models also extends to unseen languages with minimal available paired data. Through adapter-based extensions and residual adaptation methods, these models achieve notable improvements over existing baselines. Furthermore, the AST capabilities evaluated on CoVoST 2 further exhibit the versatility and adaptability of the USMs.

Implications and Future Directions

The USM signifies an important milestone in the field of multilingual speech recognition. Its core strategy in leveraging unlabeled data can potentially alleviate resource constraints that typically hinder ASR developments in underrepresented languages. By establishing a baseline of performance across hundreds of languages, USM paves the way for more inclusive and universal speech processing technologies.

Future research should concentrate on optimizing transducer selection to further elevate downstream task performance specific to different ASR applications. Additionally, there is potential in expanding the pre-training corpus to include more diverse audio sources, which could further enhance the robustness and generalizability of the models.

Conclusion

The USM's methodology and results offer substantial contributions to the ASR landscape, particularly in scaling speech technologies to accommodate global linguistic diversity. These advances propose a sustainable model for broadening language inclusiveness in speech recognition technologies, bolstering the potential for real-world applications across varied linguistic communities. By continuing to refine these models and their underlying techniques, researchers can progressively dismantle the language barriers present in modern ASR systems.

Markdown Report Issue