Emergent Mind

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

(2406.19674)
Published Jun 28, 2024 in cs.CL , cs.LG , cs.SD , and eess.AS

Abstract

Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced.

Word error rate comparison across 12 test sets, with 95% confidence intervals from bootstrap method.

Overview

  • The paper presents 'Canary', a novel multilingual automatic speech recognition (ASR) and speech translation (AST) model that achieves state-of-the-art performance without relying on web-scale data.

  • Canary employs a FastConformer-based attention encoder-decoder architecture and integrates advanced training techniques like dynamic data blending and noise-robust fine-tuning to enhance efficiency and robustness.

  • Using only 86K hours of speech data, Canary outperforms larger models on established ASR and AST benchmarks, demonstrating superior word error rates (WERs) across multiple languages and competitive BLEU scores in translation tasks.

Less is More: Accurate Speech Recognition and Translation without Web-Scale Data

The paper by Krishna C. Puvvada et al. titled "Less is More: Accurate Speech Recognition and Translation without Web-Scale Data" presents a novel multilingual automatic speech recognition (ASR) and speech translation (AST) model called Canary. This model challenges the prevailing notion that vast amounts of data are essential for achieving state-of-the-art performance. The research demonstrates that it is possible to match or exceed the performance of contemporary large-scale models using significantly less data.

Key Contributions

  1. Model Architecture: Canary leverages a FastConformer-based attention encoder-decoder (AED) architecture. FastConformer, a modification of the Conformer model, enhances efficiency by increasing the downsampling factor. This results in a substantial speedup while maintaining the model's capacity to accurately process speech data.
  2. Training Techniques: The model's data efficiency is achieved through advanced training methods, which include:

    • Dynamic Data Blending: Ensuring balanced sampling of multiple languages and datasets to avoid overfitting to specific domains.
    • Dynamic Bucketing and Batch Size: Utilizing stratified sampling based on utterance duration to optimize the batching process and reduce padding, resulting in efficient use of computational resources.
    • Noise-Robust Fine-Tuning: Incorporating non-speech audio data to reduce the model's susceptibility to hallucinations, thereby enhancing its robustness.
  3. Training Data and Efficiency: The research showcases that Canary was trained on a mere 86K hours of speech data using a mixture of public and in-house datasets across multiple languages—an order of magnitude less than what is typically used by comparable models like Whisper and SeamlessM4T. Despite this, Canary outperforms these models on established ASR and AST benchmarks.

Experimental Results

ASR Performance

Canary was rigorously evaluated across multiple languages (English, German, Spanish, and French) using standard test sets such as MCV-16.1, MLS, and VoxPopuli. The model achieved superior WERs compared to state-of-the-art baselines on most test sets, demonstrating its effectiveness in multilingual ASR. Notably, Canary achieved an average WER of 6.20% on English, 6.27% on German, 4.09% on Spanish, and 5.39% on French, outperforming larger models like SeamlessM4T-large-v2.

AST Performance

For speech-to-text translation tasks, Canary was evaluated on datasets like FLEURS, mExpresso, and CoVoST-v2. The results reveal that Canary delivers competitive BLEU scores, often matching or surpassing models of similar size like SeamlessM4T-medium, even though it was trained solely on pseudo-labeled translation data. The model's ability to translate from and to multiple languages highlights its versatility in AST applications.

Robustness to Hallucinations

An interesting aspect of the study is its focus on the robustness of the Canary model to hallucinations, particularly when processing non-speech audio. The results indicate a significant reduction in hallucinated characters when noise-robust training is employed, demonstrating the model's enhanced reliability in real-world applications.

Implications and Future Directions

The research presents several practical and theoretical implications:

  • Data Efficiency: The success of Canary suggests that advanced training techniques and efficient model architectures can mitigate the need for vast amounts of training data, potentially lowering the barrier to entry for developing high-performance ASR and AST systems.
  • Model Robustness: The introduction of noise-robust fine-tuning opens avenues for improving the reliability of real-time speech processing systems, making them more resilient to non-speech noise and reducing erroneous outputs.
  • Open-Source Integration: By open-sourcing the model and training code, the research facilitates reproducibility and fosters collaboration within the community, promoting the further development and application of efficient speech models.

Future research could explore extending the language support of Canary and incorporating additional modalities to enhance its multimodal capabilities. Additionally, the integration of advanced streaming mechanisms could further improve the model's performance on long-form audio, extending its applicability in real-time speech recognition and translation scenarios.

Conclusion

This paper provides a comprehensive exploration of achieving high accuracy in speech recognition and translation without the reliance on web-scale data. The introduction of Canary represents a significant advancement in the field, demonstrating that with innovative methodologies and efficient architectures, it is possible to achieve state-of-the-art performance with significantly reduced data and computational resources. This research not only offers immediate practical benefits but also sets the stage for future explorations in optimizing ASR and AST models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.