PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing

Published 20 Mar 2023 in cs.CL | (2303.10845v1)

Abstract: The scaling of LLMs has greatly improved natural language understanding, generation, and reasoning. In this work, we develop a system that trained a trillion-parameter LLM on a cluster of Ascend 910 AI processors and MindSpore framework, and present the LLM with 1.085T parameters named PanGu-{\Sigma}. With parameter inherent from PanGu-{\alpha}, we extend the dense Transformer model to sparse one with Random Routed Experts (RRE), and efficiently train the model over 329B tokens by using Expert Computation and Storage Separation(ECSS). This resulted in a 6.3x increase in training throughput through heterogeneous computing. Our experimental findings show that PanGu-{\Sigma} provides state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks. Moreover, it demonstrates strong abilities when fine-tuned in application data of open-domain dialogue, question answering, machine translation and code generation.

Abstract PDF Upgrade to Chat

Authors (17)

First 10 authors:

Citations (51)

View on Semantic Scholar

Summary

The paper demonstrates that integrating Random Routed Experts (RRE) with a trillion-parameter architecture boosts training throughput by 6.3x.
The paper applies sparse heterogeneous computing with Expert Computation and Storage Separation (ECSS) to efficiently manage 329 billion tokens.
The paper achieves state-of-the-art zero-shot performance on diverse Chinese NLP tasks, highlighting its potential in dialogue, translation, and code generation.

Overview of PanGu-$: Towards Trillion Parameter LLM with Sparse Heterogeneous Computing</h2> <p>The paper "PanGu-$: Towards Trillion Parameter LLM with Sparse Heterogeneous Computing" introduces PanGu- $, a trillion-parameter LLM leveraging sparse heterogeneous computing techniques. This work builds upon the PanGu-$ \alpha $model, expanding its dense <a href="https://www.emergentmind.com/topics/transformer-architecture" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Transformer architecture</a> to incorporate Random Routed Experts (RRE) for enhanced computational efficiency. By managing to extend training over 329 billion tokens, the researchers achieved a significant improvement in training throughput, reporting a 6.3-fold increase facilitated by Expert Computation and Storage Separation (ECSS).</p> <h3 class='paper-heading' id='model-architecture'>Model Architecture</h3> <p>PanGu-$ adopts a sparse model architecture, which incorporates RRE to dynamically engage subsets of model parameters during training. This move effectively leverages expertise from a mixture-of-experts framework to reduce computational load and optimize resource use. The model architecture, when combined with heterogeneous computing, facilitates scalable training processes and offers considerable improvements in terms of throughput without compromising performance.

Dataset and Training Process

The dataset utilized for training PanGu- $consists of a massive compilation of 329 billion tokens, carefully selected to encompass a wide range of linguistic constructs necessary for robust language generation capabilities. The training process, through ECSS, separates computation and storage functions, thus mitigating the resource demands typically associated with massive LLMs, particularly in terms of memory and processing power.</p> <h3 class='paper-heading' id='performance-and-results'>Performance and Results</h3> <p>Empirical evaluations showcase that PanGu-$ achieves state-of-the-art performance in zero-shot learning across various Chinese NLP tasks, reflecting significant proficiency in natural language understanding and generation. Specifically, the model demonstrates strong capabilities upon fine-tuning across applications such as open-domain dialogue, question answering, machine translation, and code generation.

Implications and Future Directions

The advancement represented by PanGu-$ holds several implications for AI research and practical applications. In the theoretical field, the model's capacity to utilize sparsity principles and heterogeneous computing could inform future developments in scaling AI systems efficiently. Practically, its proficiency in diverse tasks suggests potential deployments in areas where language understanding and generation are critical, such as customer support, automated translation services, and software development.</p> <p>Future work might explore further optimizations in sparsity strategies, possibly extending application to multi-lingual contexts or more domain-specific tasks. Additionally, refining sparse heterogeneous computing techniques within distributed training environments could yield even greater efficiencies, paving the path for more accessible large-scale model training across different computational infrastructures.</p> <p>In summary, PanGu-$ contributes significantly to the landscape of trillion-parameter models, showcasing effective strategies in scale-up via sparse heterogeneous computational methods and promising far-reaching impacts both in theoretical exploration and practical deployment of LLMs.

Markdown Report Issue