Emergent Mind

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

(2406.05955)
Published Jun 10, 2024 in cs.LG and cs.CL

Abstract

Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of LLMs without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5x decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. Our models are available at \url{https://huggingface.co/PowerInfer}

dReLU-based sparsified models outperform similar models on the Open LLM Leaderboard, especially TurboSparse-Mixtral-47B.

Overview

  • The paper introduces a new activation function called dReLU, which, when integrated with Mixture-of-Experts (MoE) models, significantly reduces computational needs while maintaining the performance of LLMs.

  • The proposed dReLU function increases sparsity by activating neurons only when positive values are present, thus reducing the number of active parameters during inference. This leads to substantial improvements in computational efficiency and speed.

  • Results demonstrate that models using dReLU, such as TurboSparse-Mistral-7B and TurboSparse-Mixtral-47B, achieve faster inference speeds and higher rankings on the Open LLM Leaderboard compared to non-sparsified models, with notable practical and theoretical implications for efficient AI deployment.

Analysis of Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

The paper "Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters" introduces a novel approach to enhance the efficiency of LLMs through activation sparsity. The primary innovation is the development of a new activation function, termed dReLU, and the integration of this function with Mixture-of-Experts (MoE) models to achieve substantial reductions in computational requirements without a corresponding performance degradation.

Introduction and Motivation

As LLMs scale up, the computational demands and associated costs become substantial barriers. Existing dense models activate all parameters during inference, leading to inefficiencies. Conditional computation techniques like MoE address this by activating only parts of the model relevant to specific inputs, but the activation functions commonly used, such as SwiGLU and GeGLU, do not sufficiently leverage activation sparsity.

The authors posit that the inefficiency of common activation functions and the need for extensive training data are major hurdles. Existing methods like ReLUfication, which replaces smoother activation functions with ReLU, have not achieved the desired sparsity and are prone to performance degradation. This paper proposes an innovative solution in the form of a dReLU function combined with a high-quality training data mixture to tackle these challenges effectively.

Methodology

The dReLU Activation Function

The paper introduces dReLU to replace the conventional activation functions used in LLMs. dReLU aims to maximize sparsity by applying ReLU after both the gate and up projections. This method contrasts traditional SwiGLU, which does not sufficiently mask inactive neurons, leading to continued inefficiencies. The dReLU function thus provides near 90% sparsity by ensuring that both components activate only when positive values are present.

Implementation and Validation

The methodology was tested by applying dReLU during the pretraining of small-scale LLMs using a diverse range of open-source datasets (including web, code, and mathematical datasets) to increase the effectiveness of the sparsification process. The TurboSparse-Mistral-7B and TurboSparse-Mixtral-47B models were used to validate the approach, with impressive results achieved: only 2.5 billion and 4.3 billion parameters were activated per inference iteration, respectively, yet these models matched or outperformed their original counterparts.

Results

The evaluation of the dReLU-based models demonstrated a 2 to 5 times decoding speedup. Specifically, the TurboSparse-Mixtral-47B model achieved an inference speed of 11 tokens per second on mobile phones. Additionally, as depicted in their findings, the sparsified models consistently ranked higher on the Open LLM Leaderboard compared to their non-sparsified counterparts.

Practical and Theoretical Implications

Practically, the dReLU-based approach allows for the deployment of high-performing LLMs on less powerful hardware, significantly reducing resource requirements and making advanced AI technologies more accessible. The decreased energy consumption due to reduced activation parameters also offers an environmentally friendly benefit.

Theoretically, this research underscores the importance of investigating intrinsic activation sparsity within LLMs and MoE architectures. By effectively leveraging sparsity, the field can move toward more efficient AI models that reconcile the demands for performance and cost. The revelation that neuron sparsification can be successfully extended to MoE models suggests that further research into neuron-level sparse computations within such frameworks could yield additional efficiency gains.

Future Directions

The paper opens the door to several future research trajectories:

  1. Extended Pretraining: The effectiveness of dReLU should be tested with extended pretraining data beyond the 150 billion tokens used in the current study to fully ascertain its robustness across larger datasets.
  2. Deeper Integration with Hardware: There is potential for hardware-software co-design to further optimize the sparse computation patterns identified. This could involve developing new hardware architectures tailored to sparse neural networks.
  3. Broader Activation Functions: Exploring other variations of activation functions that could further enhance sparsity while maintaining or improving model performance.
  4. Scalability: Assessing the performance of dReLU on larger, more complex models and in varied application domains can provide insights into its scalability and generalizability.

Conclusion

The introduction of dReLU and its subsequent application to LLMs and MoE models mark a significant stride towards more efficient AI models. By achieving high sparsity levels without compromising performance, this approach promises to lower the barriers to deploying advanced LLMs, making them more environmentally friendly and accessible to a wider range of users. This work sets a precedent for future research in conditional computation, urging the AI community to continually seek innovative solutions for model efficiency.

The complete findings and the released TurboSparse models can be accessed at: PowerInfer on Hugging Face.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.