ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

Published 6 Oct 2023 in cs.LG and cs.AI | (2310.04564v1)

Abstract: LLMs with billions of parameters have drastically transformed AI applications. However, their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices. Despite recent trends favoring alternative activation functions such as GELU or SiLU, known for increased computation, this study strongly advocates for reinstating ReLU activation in LLMs. We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer. This reduction is particularly valuable during the memory-bound inference step, where efficiency is paramount. Exploring sparsity patterns in ReLU-based LLMs, we unveil the reutilization of activated neurons for generating new tokens and leveraging these insights, we propose practical strategies to substantially reduce LLM inference computation up to three times, using ReLU activations with minimal performance trade-offs.

Abstract PDF HTML Upgrade to Chat

References (83)

Citations (43)

View on Semantic Scholar

Summary

The paper demonstrates that using ReLU yields significant activation sparsity, reducing FLOPS by up to 32% while maintaining competitive accuracy.
It introduces 'relufication', a two-stage process of replacing activations and adding extra ReLU layers to achieve up to threefold efficiency gains in large models.
Empirical results highlight that leveraging aggregated sparsity and modified ReLU variants can optimize inference in resource-constrained settings.

Exploiting Activation Sparsity in LLMs: A Case for ReLU

LLMs have transformed artificial intelligence applications, but the computational demands during inference create challenges for deployment in resource-constrained environments. This paper investigates the role of activation functions and re-evaluates the potential use of the Rectified Linear Unit (ReLU) in LLMs. The study explores activation sparsity to enhance model efficiency without significantly sacrificing performance, making the case for leveraging ReLU activations over alternatives like GELU and SiLU.

Activation Functions and Computational Load

The paper first challenges the trend favoring smoother activation functions in modern LLMs. Historically, alternatives such as GELU and SiLU have been preferred due to their marginal improvements in convergence and accuracy. However, through an experimental setup comparing these to ReLU, the study finds that the performance differences are negligible when models are trained on substantial datasets. The authors argue that while smoother activation functions may offer slight performance gains, the increased computational cost during inference outweighs these benefits when efficiency is prioritized.

Activation Sparsity: Theoretical Insights and Empirical Results

A key element of this research is the discussion of activation sparsity—a phenomenon where a substantial portion of neurons remains inactive (zeroed-out) during forward passes of the network. The paper illustrates that ReLU induces significant activation sparsity, thereby reducing the number of floating-point operations (FLOPS) during inference. For example, in an OPT model using ReLU, the sparsity in some layers can exceed 90%, translating into a 32% reduction in computation needed for inference compared to baseline models using GELU or SiLU.

Practical Efficiency Gains Through "Relufication"

The authors introduce the concept of "relufication," which involves replacing existing activation functions with ReLU in pretrained LLMs and further optimizing the network structure. The paper describes two stages of this process:

Replacement of Activation Functions: Fine-tuning pretrained models initially using non-ReLU activations with ReLU, thereby increasing activation sparsity significantly.
Insertion of Additional ReLU Layers: By placing extra ReLU layers after normalization layers, both in attention and feed-forward components, the study achieves further enhancement of sparsity, decreasing FLOPS without notable accuracy loss.

Models subjected to this relufication process showed a substantial improvement in efficiency. For large models, the relufication led to FLOPS reductions up to threefold, effectively optimizing computational and memory requirements while maintaining competitive performance on standard NLP benchmarks.

Leveraging Aggregated Sparsity and Future Directions

The paper introduces the notion of aggregated sparsity—a measure of neuron utilization across several tokens. It reveals that neurons activated during one token generation tend to be re-utilized for subsequent tokens, thus offering an opportunity to streamline computational processes through inferential optimizations like speculative decoding. Speculative decoding benefits further from aggregated sparsity, resulting in enhanced speedup by leveraging shared activations efficiently.

The authors also explore the potential of modified ReLU activations, such as shifted ReLU, to further increase sparsity without compromising model performance. This direction suggests that performance optimization might be achieved through strategic manipulation of activation thresholds.

Conclusion

The research advocates for a reassessment of activation function preferences in LLMs, emphasizing activation sparsity as a means to reconcile robust performance with computational efficiency. By reviving ReLU, the study provides a practical pathway to more resource-efficient LLMs, potentially broadening deployment across various hardware environments. The insights into activation patterns and strategies to exploit them pave the way for future research aimed at enhancing the efficiency of AI systems through architectural innovations.

Markdown Report Issue