U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF (2404.16407v2)

Published 25 Apr 2024 in cs.CL and eess.AS

Abstract: Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable LLMs and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency.

References (20)

Authors (8)

Xingchen Song (18 papers)
Di Wu (477 papers)
Binbin Zhang (46 papers)
Dinghao Zhou (7 papers)
Zhendong Peng (20 papers)
Bo Dang (16 papers)
Fuping Pan (11 papers)
Chao Yang (333 papers)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a novel integration that replaces FFN layers with MoE layers in both encoder and decoder components of Conformer-based ASR models.
The MoE-1B model attains comparable Word Error Rate to Dense-1B systems while matching the real-time efficiency of Dense-225M, showcasing superior scaling.
A dynamic two-stage training approach is employed to transition from non-streaming to streaming setups, maintaining high efficiency without performance loss.

Simplified Integration of Mixture-of-Experts in ASR Models Achieves High Efficiency with Scaled Performance

Introduction

The evolution of neural network architectures for Automatic Speech Recognition (ASR) has consistently aimed at enhancing model performance while addressing computational and efficiency challenges. The latest innovations have seen the incorporation of Mixture-of-Experts (MoE) to manage the computational demands of scaling models. This research explores a straightforward approach to integrating MoE layers in place of traditional Feed-Forward Network (FFN) layers within both encoder and decoder components of Conformer-based ASR models. A substantial benchmarking dataset, totaling 160,000 hours, demonstrates that such integration not only simplifies the model architecture but also maintains high efficiency without compromising accuracy.

Model Architecture and Methodology

The core architectural component of this model is the Conformer, which is utilized for the encoder, while the transformer architecture is employed for the decoder. Each conventional FFN within these structures is replaced by an MoE layer consisting of multiple expert FFNs governed by a routing mechanism. This adjustment aims to leverage the sparsity for computational saving while keeping the model capacity high. The U2++ framework, known for its dual capability in streaming and non-streaming setups, underpins the proposed model, permitting dynamic adjustment of training strategies to align with either mode effectively.

Encoder and Decoder Modification: All FFN layers are substituted with MoE layers, which consist of a routing network and several expert networks.
Training Losses: Utilizes a combined loss function comprising Connectionist Temporal Classification (CTC) and Autoregressive Encoder Decoder (AED) losses without any auxiliary losses for load balancing or expert routing.
Dynamic Chunk Masking: For streaming capabilities, a dynamic chunk masking strategy is employed allowing the model to handle variable chunk sizes, facilitating both streaming and non-streaming functionalities seamlessly.

Experimental Setup and Results

The experiments were conducted using a large-scale dataset predominantly in Mandarin, with a minor portion in English. The results were benchmarked against Dense-225M and Dense-1B models for comparative analysis. Impressively, the MoE-1B model achieved comparable Word Error Rate (WER) to the Dense-1B model, while preserving the real-time operation efficiency of the Dense-225M setup under similar computational conditions.

WER and Model Efficiency: The MoE-1B model demonstrates a comparable WER to the Dense-1B model but exhibits significantly more computational efficiency, thus aligning the benefits of scaled performance with practical deployability.
Inference Efficiency: In terms of Real-Time Factor (RTF), the MoE-1B model essentially matches the Dense-225M model despite having a parameter count closer to the Dense-1B model, highlighting the efficiency of MoE integrations.

Discussion on Streaming Abilities

A noteworthy aspect of this work is extending MoE integration to support streaming capabilities, a challenge often encountered with large-scale models. By employing a two-stage training approach that first establishes a robust non-streaming base before transitioning to a streaming-compatible configuration, the U2++ MoE successfully supports real-time ASR processing demands without degrading performance.

Future Implications and Developments

This research lays foundational work for further exploration into simple yet effective scaling strategies for ASR systems, particularly in how MoE layers can be utilized across different neural network architectures beyond Conformers. The findings encourage the pursuit of MoE models that prioritize not just performance but also operational efficiency and flexibility across different deployment scenarios, possibly extending beyond speech recognition into other domains of AI that require large-scale modeling capabilities.

PDF Markdown

Related Papers

Tweets

https://twitter.com/alphacep/status/1784751216653717651

https://twitter.com/AudioAndSpeech/status/1783722351248425442