Emergent Mind

Shortened LLaMA: A Simple Depth Pruning for Large Language Models

(2402.02834)
Published Feb 5, 2024 in cs.LG and cs.CL

Abstract

Structured pruning of modern LLMs has emerged as a way of decreasing their high computational needs. Width pruning reduces the size of projection weight matrices (e.g., by removing attention heads) while maintaining the number of layers. Depth pruning, in contrast, removes entire layers or blocks, while keeping the size of the remaining weights unchanged. Most current research focuses on either width-only or a blend of width and depth pruning, with little comparative analysis between the two units (width vs. depth) concerning their impact on LLM inference efficiency. In this work, we show that a simple depth pruning approach can compete with recent width pruning methods in terms of zero-shot task performance. Our pruning method boosts inference speeds, especially under memory-constrained conditions that require limited batch sizes for running LLMs, where width pruning is ineffective. We hope this work can help deploy LLMs on local and edge devices.

Comparison between width pruning and depth pruning in terms of matrix size and computational operations.

Overview

  • The paper presents a novel simple depth pruning strategy for deploying LLMs efficiently on devices with limited memory, such as local and edge devices.

  • Depth pruning, which involves removing entire Transformer blocks, is shown to enhance inference speeds without significantly impacting zero-shot task performance.

  • The study uses metrics like Taylor Expansion-based importance scores and Perplexity-based analyses to measure the effect of depth pruning on model performance.

  • The resulting pruned models decrease GPU memory usage and accelerate inference, particularly for sequences with limited input scales.

Abstract

The paper introduces a novel approach to deploying LLMs on local and edge devices by exploring depth pruning as an alternative to the widely studied width pruning. While width pruning focuses on reducing the network's width by eliminating parts like attention heads or neurons, depth pruning involves the removal of entire layers or blocks. The authors propose a simple yet effective depth pruning strategy that excels in boosting inference speeds under memory-constrained conditions, which are common in local or small-scale GPU devices.

Introduction

The prominence of LLMs in achieving state-of-the-art results in varied language-based tasks is well-documented. However, their deployment remains hampered by high computational needs. Expansion of batch sizes to enhance GPU utilization faces limitations on lower-specification GPUs. This research concentrates on structured pruning as a means to increase the accessibility of LLMs by enabling their deployment even on devices with stringent memory constraints.

Methodology

The authors' methodology is centered on investigating the efficiency of depth pruning, which they argue is commonly underestimated. Unlike traditional width pruning that may not significantly improve or even worsen generation speeds, depth pruning by eliminating entire Transformer blocks decreases the number of costly memory operations and matrix calculations, thus improving inference time. The paper evaluates the impact of removing blocks using metrics such as Taylor Expansion-based importance scores and Perplexity-based analyses. They prioritize empirical results over theoretical analysis, demonstrating that block-level pruning not only accelerates inference but also maintains comparable zero-shot task performance to finely-width-pruned models.

Results and Discussion

Supported by extensive experiments and empirical evidence, the paper clearly demonstrates the advantages of depth pruning. Numerically, the inference efficiencies highlight considerable speed improvements, especially notable for sequences with limited input scales. The authors also compare the pruned network's generation quality to its original form, exhibiting no significant loss in quality. From a resource usage perspective, their pruned models manage to efficiently reduce GPU memory requirements across various batch sizes.

Conclusion

In conclusion, the paper offers a compelling argument for the adoption of depth pruning techniques in LLMs, especially for deployment in memory-constrained environments. The authors substantiate their claim that even a simple depth pruning approach can achieve substantial improvements in inference speed without compromising the model's efficacy in zero-shot tasks. They encapsulate the need for further investigation into retraining methods and a deeper dive into calibration data while pointing out the importance of a hardware-agnostic approach in model compression, identifying their work as a critical pivot away from the mainstream focus on width pruning.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.