U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF (2404.16407v2)
Abstract: Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable LLMs and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency.
- “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. 2017, OpenReview.net.
- “Gshard: Scaling giant models with conditional computation and automatic sharding,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. 2021, OpenReview.net.
- “Mixture-of-expert conformer for streaming multilingual ASR,” CoRR, vol. abs/2305.15663, 2023.
- “Speechmoe: Scaling to large acoustic models with dynamic routing mixture of experts,” in Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, and Petr Motlícek, Eds. 2021, pp. 2077–2081, ISCA.
- “Language-routing mixture of experts for multilingual and code-switching speech recognition,” CoRR, vol. abs/2307.05956, 2023.
- “3m: Multi-loss, multi-path and multi-level neural networks for speech recognition,” in 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Singapore, December 11-14, 2022, Kong Aik Lee, Hung-yi Lee, Yanfeng Lu, and Minghui Dong, Eds. 2022, pp. 170–174, IEEE.
- “Wenet 2.0: More productive end-to-end speech recognition toolkit,” CoRR, vol. abs/2203.15455, 2022.
- “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, pp. 9, 2019.
- “Language models are few-shot learners,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, Eds., 2020.
- “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, Eds. 2023, vol. 202 of Proceedings of Machine Learning Research, pp. 28492–28518, PMLR.
- “Google USM: scaling automatic speech recognition beyond 100 languages,” CoRR, vol. abs/2303.01037, 2023.
- “Beyond distillation: Task-level mixture-of-experts for efficient inference,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, Eds. 2021, pp. 3577–3599, Association for Computational Linguistics.
- “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” J. Mach. Learn. Res., vol. 23, pp. 120:1–120:39, 2022.
- “An unsupervised autoregressive model for speech representation learning,” in 20th Annual Conference of the International Speech Communication Association (Interspeech 2019), Gernot Kubin and Zdravko Kacic, Eds., Graz, Austria, 2019, pp. 146–150, ISCA.
- “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, Helen Meng, Bo Xu, and Thomas Fang Zheng, Eds. 2020, pp. 5036–5040, ISCA.
- “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems (NeurIPS 2017), Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, Eds., Long Beach, USA, 2017, pp. 5998–6008, ACM.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in 23rd International Conference on Machine Learning (ICML 2006), William W. Cohen and Andrew W. Moore, Eds., Pittsburgh, USA, 2006, pp. 369–376, ACM.
- “Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit,” in Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, and Petr Motlícek, Eds. 2021, pp. 4054–4058, ISCA.
- “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” in KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash, Eds. 2020, pp. 3505–3506, ACM.
- “Scaling laws for neural language models,” CoRR, vol. abs/2001.08361, 2020.
- Xingchen Song (18 papers)
- Di Wu (477 papers)
- Binbin Zhang (46 papers)
- Dinghao Zhou (7 papers)
- Zhendong Peng (20 papers)
- Bo Dang (16 papers)
- Fuping Pan (11 papers)
- Chao Yang (333 papers)