Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters (2406.05955v2)

Published 10 Jun 2024 in cs.LG and cs.CL

Abstract: Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of LLMs without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5x decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. Our models are available at \url{https://huggingface.co/PowerInfer}

Citations (13)

View on Semantic Scholar

Summary

The paper presents the dReLU activation that achieves up to 90% sparsity in LLMs with minimal activated parameters.
It demonstrates significant decoding speedups (2–5x) and efficient computation using MoE models, validated on TurboSparse variants.
The approach reduces hardware and energy demands, paving the way for cost-effective, high-performance AI deployments.

Analysis of Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

The paper "Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters" introduces a novel approach to enhance the efficiency of LLMs through activation sparsity. The primary innovation is the development of a new activation function, termed dReLU, and the integration of this function with Mixture-of-Experts (MoE) models to achieve substantial reductions in computational requirements without a corresponding performance degradation.

Introduction and Motivation

As LLMs scale up, the computational demands and associated costs become substantial barriers. Existing dense models activate all parameters during inference, leading to inefficiencies. Conditional computation techniques like MoE address this by activating only parts of the model relevant to specific inputs, but the activation functions commonly used, such as SwiGLU and GeGLU, do not sufficiently leverage activation sparsity.

The authors posit that the inefficiency of common activation functions and the need for extensive training data are major hurdles. Existing methods like ReLUfication, which replaces smoother activation functions with ReLU, have not achieved the desired sparsity and are prone to performance degradation. This paper proposes an innovative solution in the form of a dReLU function combined with a high-quality training data mixture to tackle these challenges effectively.

Methodology

The dReLU Activation Function

The paper introduces dReLU to replace the conventional activation functions used in LLMs. dReLU aims to maximize sparsity by applying ReLU after both the gate and up projections. This method contrasts traditional SwiGLU, which does not sufficiently mask inactive neurons, leading to continued inefficiencies. The dReLU function thus provides near 90% sparsity by ensuring that both components activate only when positive values are present.

Implementation and Validation

The methodology was tested by applying dReLU during the pretraining of small-scale LLMs using a diverse range of open-source datasets (including web, code, and mathematical datasets) to increase the effectiveness of the sparsification process. The TurboSparse-Mistral-7B and TurboSparse-Mixtral-47B models were used to validate the approach, with impressive results achieved: only 2.5 billion and 4.3 billion parameters were activated per inference iteration, respectively, yet these models matched or outperformed their original counterparts.

Results

The evaluation of the dReLU-based models demonstrated a 2 to 5 times decoding speedup. Specifically, the TurboSparse-Mixtral-47B model achieved an inference speed of 11 tokens per second on mobile phones. Additionally, as depicted in their findings, the sparsified models consistently ranked higher on the Open LLM Leaderboard compared to their non-sparsified counterparts.

Practical and Theoretical Implications

Practically, the dReLU-based approach allows for the deployment of high-performing LLMs on less powerful hardware, significantly reducing resource requirements and making advanced AI technologies more accessible. The decreased energy consumption due to reduced activation parameters also offers an environmentally friendly benefit.

Theoretically, this research underscores the importance of investigating intrinsic activation sparsity within LLMs and MoE architectures. By effectively leveraging sparsity, the field can move toward more efficient AI models that reconcile the demands for performance and cost. The revelation that neuron sparsification can be successfully extended to MoE models suggests that further research into neuron-level sparse computations within such frameworks could yield additional efficiency gains.

Future Directions

The paper opens the door to several future research trajectories:

Extended Pretraining: The effectiveness of dReLU should be tested with extended pretraining data beyond the 150 billion tokens used in the current paper to fully ascertain its robustness across larger datasets.
Deeper Integration with Hardware: There is potential for hardware-software co-design to further optimize the sparse computation patterns identified. This could involve developing new hardware architectures tailored to sparse neural networks.
Broader Activation Functions: Exploring other variations of activation functions that could further enhance sparsity while maintaining or improving model performance.
Scalability: Assessing the performance of dReLU on larger, more complex models and in varied application domains can provide insights into its scalability and generalizability.

Conclusion

The introduction of dReLU and its subsequent application to LLMs and MoE models mark a significant stride towards more efficient AI models. By achieving high sparsity levels without compromising performance, this approach promises to lower the barriers to deploying advanced LLMs, making them more environmentally friendly and accessible to a wider range of users. This work sets a precedent for future research in conditional computation, urging the AI community to continually seek innovative solutions for model efficiency.

The complete findings and the released TurboSparse models can be accessed at: PowerInfer on Hugging Face.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1800368969393549525

https://twitter.com/fly51fly/status/1802463787515330930

https://twitter.com/cleanunicorn/status/1872813140221513851

https://twitter.com/ADarmouni/status/1800680301762424853

https://twitter.com/acousticprotest/status/1886169644060258615

https://twitter.com/secemp9/status/1800815042121932904

HackerNews

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters (11 points, 0 comments)