Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models (2407.04439v2)

Published 5 Jul 2024 in eess.AS

Abstract: Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch. To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning.   PMLR, 2023, pp. 28 492–28 518.
  2. S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  3. F. Mai, J. Zuluaga-Gomez, T. Parcollet, and P. Motlicek, “HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition,” in Proc. INTERSPEECH 2023, 2023, pp. 2213–2217.
  4. R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, “End-to-end speech recognition: A survey,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  5. A. Graves and A. Graves, “Connectionist temporal classification,” Supervised sequence labelling with recurrent neural networks, pp. 61–93, 2012.
  6. A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  7. F. Kuang, L. Guo, W. Kang, L. Lin, M. Luo, Z. Yao, and D. Povey, “Pruned rnn-t for fast, memory-efficient asr training,” arXiv preprint arXiv:2206.13236, 2022.
  8. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  9. C.-F. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le, M. Jain, K. Schubert, C. Fuegen, and M. L. Seltzer, “Transformer-transducer: End-to-end speech recognition with self-attention,” arXiv preprint arXiv:1910.12977, 2019.
  10. Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss,” in ICASSP.   IEEE, 2020, pp. 7829–7833.
  11. B. Li, A. Gulati, J. Yu, T. N. Sainath, C.-C. Chiu, A. Narayanan, S.-Y. Chang, R. Pang, Y. He, J. Qin et al., “A better and faster end-to-end model for streaming asr,” in ICASSP.   IEEE, 2021, pp. 5634–5638.
  12. V. Noroozi, S. Majumdar, A. Kumar, J. Balam, and B. Ginsburg, “Stateful fastconformer with cache-based inference for streaming automatic speech recognition,” arXiv preprint arXiv:2312.17279, 2023.
  13. B. Li, S.-y. Chang, T. N. Sainath, R. Pang, Y. He, T. Strohman, and Y. Wu, “Towards fast and accurate streaming end-to-end asr,” in ICASSP.   IEEE, 2020, pp. 6069–6073.
  14. A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” arXiv preprint arXiv:2006.13979, 2020.
  15. V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi et al., “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516, 2023.
  16. J. Ao, Z. Zhang, L. Zhou, S. Liu, H. Li, T. Ko, L. Dai, J. Li, Y. Qian, and F. Wei, “Pre-training transformer decoder for end-to-end asr model with unpaired speech data,” arXiv preprint arXiv:2203.17113, 2022.
  17. Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and Y. Wu, “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2021, pp. 244–250.
  18. J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y. Wu, S. Liu, T. Ko, Q. Li, Y. Zhang et al., “Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,” arXiv preprint arXiv:2110.07205, 2021.
  19. T. Javed, S. Doddapaneni, A. Raman, K. S. Bhogale, G. Ramesh, A. Kunchukuttan, P. Kumar, and M. M. Khapra, “Towards building asr systems for the next billion users,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 10 813–10 821.
  20. A. Bapna, C. Cherry, Y. Zhang, Y. Jia, M. Johnson, Y. Cheng, S. Khanuja, J. Riesa, and A. Conneau, “mSLAM: Massively multilingual joint pre-training for speech and text,” arXiv preprint arXiv:2202.01374, 2022.
  21. P. Swietojanski, S. Braun et al., “Variable attention masking for configurable transformer transducer speech recognition,” in ICASSP.   IEEE, 2023, pp. 1–5.
  22. G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” arXiv preprint arXiv:2309.17453, 2023.
  23. M. Ghodsi, X. Liu, J. Apfel, R. Cabrera, and E. Weinstein, “Rnn-transducer with stateless prediction network,” in ICASSP.   IEEE, 2020, pp. 7049–7053.
  24. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  25. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
  26. Z. Yao, L. Guo, X. Yang, W. Kang, F. Kuang, Y. Yang, Z. Jin, L. Lin, and D. Povey, “Zipformer: A faster and better encoder for automatic speech recognition,” arXiv preprint arXiv:2310.11230, 2023.
  27. D. Rekesh, N. R. Koluguri, S. Kriman, S. Majumdar, V. Noroozi, H. Huang, O. Hrinchuk, K. Puvvada, A. Kumar, J. Balam et al., “Fast conformer with linearly scalable attention for efficient speech recognition,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2023, pp. 1–8.
  28. A. Vyas, S. Madikeri, and H. Bourlard, “Comparing ctc and lfmmi for out-of-domain adaptation of wav2vec 2.0 acoustic model,” in Proceedings of Interspeech, 2021. [Online]. Available: https://arxiv.org/abs/2104.02558
  29. M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., “Big bird: Transformers for longer sequences,” Advances in neural information processing systems, vol. 33, pp. 17 283–17 297, 2020.
  30. J. Carletta, S. Ashby et al., “The ami meeting corpus: A pre-announcement,” in International workshop on machine learning for multimodal interaction.   Springer, 2005, pp. 28–39.
  31. R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4218–4222.
  32. M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 2019, pp. 48–53.
  33. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  34. G. Attanasio, B. Savoldi, D. Fucci, and D. Hovy, “Multilingual speech models for automatic speech recognition exhibit gender performance gaps,” arXiv preprint arXiv:2402.17954, 2024.
Citations (1)

Summary

  • The paper introduces XLSR-Transducer, merging XLSR-53 with transducer architecture to enable streaming ASR with reduced word error rates.
  • It leverages innovative attention masking techniques, including chunked and variable left-context patterns, to facilitate efficient real-time decoding.
  • Evaluations on the AMI and CommonVoice datasets reveal 4-12% absolute WER improvements, demonstrating its robustness in low-resource settings.

Overview of "XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models"

The paper "XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models" presents a noteworthy exploration into the adaptation of self-supervised pretrained models for streaming automatic speech recognition (ASR) using the transducer architecture. The research addresses the inherent limitation of popular pretrained models, such as XLSR-53, which are traditionally trained with full attention context and thus, are unsuitable for real-time ASR applications.

Key Contributions

  1. Introduction of XLSR-Transducer: The paper introduces the XLSR-Transducer, a streaming ASR model that utilizes the XLSR-53 model as an encoder within the transducer architecture. The proposed model achieves a significant 4% absolute improvement in word error rate (WER) over the Whisper large-v2 model and an 8% improvement over a Zipformer transducer model trained from scratch on the AMI dataset.
  2. Attention Masking for Streaming Capability: To enable streaming capabilities, the authors explore various attention masking patterns in the self-attention calculation of transformer layers within the XLSR-53 model. These strategies include chunked masking and chunked masking with variable left context chunks, which allow for efficient streaming decoding by limiting the context frames considered during self-attention computation.
  3. Evaluation Framework: The performance of the XLSR-Transducer is validated on the AMI dataset and five languages from the CommonVoice dataset under low-resource scenarios. This broad evaluation establishes the model's effectiveness across different languages and data conditions.
  4. Introduction of Attention Sinks in ASR: The research explores the utilization of attention sinks, a phenomenon where transformer layers learn to disproportionately attend to initial inputs during streaming inference. This novel application in ASR leads to a relative 12% improvement in WER by reducing the left context needed during inference by half.

Experimental Results

The results on the AMI dataset reveal that the XLSR-Transducer significantly outperforms competitive models in both streaming and non-streaming ASR settings. For non-streaming ASR, the XLSR-Transducer achieves a WER of 12.7%, relatively outperforming the Whisper large-v2 model by 25% and the Zipformer model by 39%. For streaming ASR, the model trained with multi-chunk masking achieves a WER of 17.7% with a 320 ms chunk size and 14.2% with a 1280 ms chunk size.

Similarly, the evaluation on the CommonVoice dataset, which includes five non-English languages, demonstrates the model's robustness. The XLSR-Transducer consistently achieves competitive WERs across different chunk sizes. Importantly, the non-streaming decoding of the streaming-trained models shows negligible degradation in performance, highlighting the model's versatility.

Implications and Future Work

The introduction of the XLSR-Transducer model signifies a substantial improvement in leveraging self-supervised pretrained models for streaming ASR. This approach mitigates the dependency on large-scale in-domain supervised data, making it especially advantageous for low-resource languages and applications.

The implementation of attention sinks to streamline the self-attention mechanism provides an innovative means to balance computational efficiency and performance. This technique can be further explored to enhance real-time ASR systems, potentially extending to other domains where transformer-based models are employed.

Future developments could involve expanding this methodology to broader datasets and more diverse linguistic contexts, improving generalizability. Additionally, optimizing the computational aspects of streaming ASR through hardware-integration techniques or model distillation could further enhance the real-time applicability of XLSR-Transducer.

Conclusion

This paper makes a significant contribution to the field of ASR by adapting self-supervised learning models for effective streaming. The combination of XLSR-53 with transducer architecture and the introduction of innovative attention mechanisms results in a robust ASR system that shows significant improvements in WER while maintaining computational efficiency. The findings set a precedent for future research in streaming ASR, particularly in integrating advanced pretrained models with practical, real-time applications.