Emergent Mind

A deeper look at depth pruning of LLMs

(2407.16286)
Published Jul 23, 2024 in cs.LG and cs.AI

Abstract

LLMs are not only resource-intensive to train but even more costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance, effectively removing 10% of blocks in well-trained LLaMa-2 and Mistral 7b models without any significant degradation of downstream metrics. In this paper, we explore different block importance metrics by considering adaptive metrics such as Shapley value in addition to static ones explored in prior work. We show that adaptive metrics exhibit a trade-off in performance between tasks i.e., improvement on one task may degrade performance on the other due to differences in the computed block influences. Furthermore, we extend this analysis from a complete block to individual self-attention and feed-forward layers, highlighting the propensity of the self-attention layers to be more amendable to pruning, even allowing removal of upto 33% of the self-attention layers without incurring any performance degradation on MMLU for Mistral 7b (significant reduction in costly maintenance of KV-cache). Finally, we look at simple performance recovery techniques to emulate the pruned layers by training lightweight additive bias or low-rank linear adapters. Performance recovery using emulated updates avoids performance degradation for the initial blocks (up to 5% absolute improvement on MMLU), which is either competitive or superior to the learning-based technique.

Comparison of block influence metrics for pruning and their effect on MMLU accuracy in LLaMa-2 and Mistral.

Overview

  • The paper by Shoaib Ahmed Siddiqui et al. presents a detailed analysis of depth pruning techniques for LLMs, focusing on reducing resource usage while maintaining performance.

  • Key contributions include evaluating block importance metrics like Shapley values, layer-specific pruning insights showing self-attention layers' resilience, and effective performance recovery methods like emulated updates.

  • Experiments using models such as LLaMa-2 7b and Mistral 7b reveal methods for optimizing model efficiency, proposing practical and theoretical advancements for deploying LLMs at scale.

An Expert Perspective on Depth Pruning of LLMs

The paper "A deeper look at depth pruning of LLMs" by Shoaib Ahmed Siddiqui et al. presents an in-depth analysis of depth pruning methodologies for LLMs, particularly focusing on minimizing resource consumption without sacrificing model performance. This work builds upon prior research by introducing advanced metrics for block importance and exploring fine-grained pruning strategies within model layers.

Core Contributions

The study is structured around several key contributions:

  1. Evaluation of Block Importance Metrics: The authors critically assess various block influence metrics beyond the conventional cosine similarity used by previous work. They introduce adaptive metrics such as the Shapley value and evaluate their efficacy in pruning decisions. The analysis underscores a trade-off inherent in adaptive metrics, where optimizing for one task can inadvertently degrade performance in another.

  2. Layer-Specific Pruning: Extending beyond whole-block pruning, the paper dissects transformer blocks into self-attention and feed-forward layers. The findings indicate a higher resilience to pruning self-attention layers. Notably, the study demonstrates that up to 33% of self-attention layers in the Mistral 7b model can be removed without significant performance degradation on the MMLU benchmark.

  3. Performance Recovery Techniques: To address the performance drop from pruning, the authors propose a simple yet effective technique: emulated updates based on the empirical mean of a block's update. This is compared with low-rank linear adapters, which showed that a straightforward average update achieves competitive, if not superior, results.

Experimental Design and Findings

The experiments are comprehensive, utilizing two notable models, LLaMa-2 7b and Mistral 7b. The evaluation spans multiple metrics and tasks, providing nuanced insights into the impact of depth pruning. Key findings include:

  • Block Influence Metrics: The paper demonstrates that static metrics like cosine similarity offer stable performance for broad tasks. However, Shapley value, an adaptive metric, provides a significant improvement in model loss but can negatively impact specific task performance, such as MMLU accuracy. This suggests a potential for task-specific pruning strategies.
  • Layer Pruning: When evaluating layers individually, it was found that self-attention layers can be pruned with minimal impact on overall performance, contrary to feed-forward layers. This insight is critical for optimizing model efficiency, as self-attention mechanisms contribute significantly to computational overhead.
  • Performance Recovery: The study highlights that simple techniques like emulated updates, which apply the average block update, can effectively mitigate performance drops. This approach is either on par with or outperforms more complex learning-based techniques like low-rank adapters, thereby offering a pragmatic solution for maintaining model accuracy post-pruning.

Implications and Future Directions

The implications of this research are multifaceted, impacting both theoretical understanding and practical deployment of LLMs. Practically, the insights into block and layer-specific pruning can lead to significant reductions in computational and memory requirements, making LLMs more accessible for deployment at scale.

Theoretically, the work opens avenues for further exploration of adaptive metrics tailored to specific tasks, potentially leveraging the nuances of Shapley values. The trade-offs identified between different metrics and tasks underline the need for more sophisticated, perhaps hybrid, pruning strategies that can balance performance across varied applications.

Future research may focus on dynamically adjusting model architecture based on real-time performance feedback, further optimizing efficiency without a priori fixed pruning schedules. Additionally, enhancing the robustness of simple performance recovery techniques could provide more reliable fallback mechanisms, ensuring models maintain high utility even with significant structural modifications.

In conclusion, the paper "A deeper look at depth pruning of LLMs" provides a rigorous and detailed examination of pruning strategies, presenting actionable insights and laying the groundwork for future advancements in the efficient deployment of LLMs.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.