Emergent Mind

Abstract

Autoregressive LLMs (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges for autoregressive token-by-token generation. To mitigate computation overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. Despite some promising success due to the redundancy across LLMs layers on metrics like Rough-L/BLUE, our careful knowledge-intensive evaluation unveils issues such as generation collapse, hallucination of wrong facts, and noticeable performance drop even at the trivial exit ratio of 10-15% of layers. We attribute these errors primarily to ineffective handling of the KV cache through state copying during early-exit. In this work, we observed the saturation of computationally expensive feed-forward blocks of LLM layers and proposed FFN-SkipLLM, which is a novel fine-grained skip strategy of autoregressive LLMs. More specifically, FFN-SkipLLM is an input-adaptive feed-forward skipping strategy that can skip 25-30% of FFN blocks of LLMs with marginal change in performance on knowledge-intensive generation tasks without any requirement to handle KV cache. Our extensive experiments and ablation across benchmarks like MT-Bench, Factoid-QA, and variable-length text summarization illustrate how our simple and ease-at-use method can facilitate faster autoregressive decoding.

Comparison of baseline performances with various skip ratios against FFN-SkipLLM in multi-turn conversations over eight categories.

Overview

  • FFN-SkipLLM introduces an innovative strategy for reducing the computational load of autoregressive LLMs by skipping 25-30% of Feed-Forward Network (FFN) blocks without significantly affecting performance.

  • The method leverages the redundancy observed in FFN blocks, especially in middle layers, and the 'attention sink' phenomenon to enable selective skipping of computations.

  • Extensive testing across various benchmarks showed FFN-SkipLLM to maintain near-full model performance, outperforming traditional layer-skipping techniques.

  • Future research directions include integrating FFN-SkipLLM with other compression techniques and exploring the limits of skipping ratios without compromising output quality.

FFN-SkipLLM: Adaptive Feed-Forward Skipping Strategy for Enhanced Autoregressive Decoding in LLMs

Introduction

The exponential growth in the capabilities of Autoregressive LLMs has been met with increasing challenges related to their deployment due to the substantial computational demands these models entail. While several strategies focusing on early exits and layer dropping have been proposed to mitigate this, they often encounter limitations, such as generation collapse and hallucination issues, due to ineffective handling of the Key-Value (KV) cache. This paper introduces FFN-SkipLLM, a novel strategy that targets the computationally expensive Feed-Forward Network (FFN) blocks within LLMs' layers. By allowing for a fine-grained and input-adaptive skipping of approximately 25-30% of FFN blocks, FFN-SkipLLM achieves marginal performance changes on knowledge-intensive generation tasks without encountering the KV cache issues that hamper existing approaches.

Motivation

The observation that motivates this work is two-fold: First, a significant redundancy exists in the computation performed by FFN blocks within LLMs, particularly in the middle layers. Second, leveraging the "attention sink" phenomenon, whereby early tokens in a sequence disproportionately influence model output, allows for a portion of the model's computation to be bypassed without substantially degrading performance. This approach proposes a departure from traditional layer-skipping methodologies by focusing on FFN block skipping, thereby circumventing the complexities related to KV cache handling.

FFN-SkipLLM: An Approach to FFN Block Skipping

Preliminaries

Analysis reveals that FFN blocks, which constitute approximately two-thirds of the parameters in a given layer (as demonstrated in LLaMa-7B layers), exhibit a high degree of computational redundancy. This redundancy is primarily observed in the middle layers of LLMs, with cosines similarity analyses indicating that tensors before and after FFN blocks undergo minimal change. Consequently, FFN blocks within these "non-cold" regions emerge as prime candidates for skipping, promising substantial computational savings with negligible impact on output quality.

Methodology

FFN-SkipLLM employs a dynamic strategy that adapts FFN block skipping according to input-specific characteristics. This strategy is detailed in an algorithm that selectively bypasses FFN blocks within non-cold regions based on the cosine similarity between input and output tensors of these blocks. By maintaining the computation in the initial and final layers (cold regions) and employing a warm-up mechanism that temporarily foregoes skipping for the initial tokens, FFN-SkipLLM preserves the integrity of the KV cache and ensures a stable generation process.

Experimental Evaluation

Extensive experiments across benchmarks such as MT-Bench, Factoid-QA, and variable-length text summarization demonstrate the efficacy of FFN-SkipLLM. Notably, the model can skip a significant portion of FFN blocks while retaining nearly full model performance across a range of knowledge-intensive tasks. This capability starkly contrasts with the performance drops and inaccuracies observed in existing layer-skipping approaches, affirming the potential of FFN-SkipLLM as a more robust and efficient alternative.

Implications and Future Directions

The introduction of FFN-SkipLLM opens up new avenues for enhancing the performance and efficiency of autoregressive LLMs. By sidestepping the challenges associated with KV cache management inherent in layer-skipping strategies, this approach paves the way for more sustainable and accessible deployment of LLMs across various applications. Moving forward, integrating FFN-SkipLLM with other model compression techniques, such as sparsity and quantization, may yield further improvements in computational efficiency. Additionally, addressing the current limitations related to the scaling of skip ratios beyond 35% without performance degradation remains an area ripe for future research.

Conclusion

FFN-SkipLLM represents a significant stride toward mitigating the computational demands of deploying state-of-the-art autoregressive LLMs. By leveraging insights into the redundancy of FFN blocks and the strategic skipping of these components, this approach achieves a delicate balance between computational efficiency and model performance, heralding a new era of more accessible and performant language models.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.