Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 163 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 125 tok/s Pro

Kimi K2 208 tok/s Pro

GPT OSS 120B 445 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods (2402.02834v2)

Published 5 Feb 2024 in cs.LG and cs.CL

Abstract: Structured pruning of modern LLMs has emerged as a way of decreasing their high computational needs. Width pruning reduces the size of projection weight matrices (e.g., by removing attention heads) while maintaining the number of layers. Depth pruning, in contrast, removes entire layers or blocks, while keeping the size of the remaining weights unchanged. Most current research focuses on either width-only or a blend of width and depth pruning, with little comparative analysis between the two units (width vs. depth) concerning their impact on LLM inference efficiency. In this work, we show that simple depth pruning can effectively compress LLMs while achieving comparable or superior performance to recent width pruning studies. Our pruning method boosts inference speeds, especially under memory-constrained conditions that require limited batch sizes for running LLMs, where width pruning is ineffective. In retraining pruned models for quality recovery, continued pretraining on a large corpus markedly outperforms LoRA-based tuning, particularly at severe pruning ratios. We hope this work can help build compact yet capable LLMs. Code and models can be found at: https://github.com/Nota-NetsPresso/shortened-LLM

Citations (21)

View on Semantic Scholar

Summary

The paper demonstrates that depth pruning, which removes entire Transformer blocks, substantially boosts inference speed without degrading performance.
It employs Taylor Expansion-based scores and perplexity analyses to compare depth pruning against traditional width pruning methods.
Results indicate significant GPU memory reduction and faster processing on local devices, enabling efficient LLM deployment under tight resource constraints.

Abstract

The paper introduces a novel approach to deploying LLMs on local and edge devices by exploring depth pruning as an alternative to the widely studied width pruning. While width pruning focuses on reducing the network's width by eliminating parts like attention heads or neurons, depth pruning involves the removal of entire layers or blocks. The authors propose a simple yet effective depth pruning strategy that excels in boosting inference speeds under memory-constrained conditions, which are common in local or small-scale GPU devices.

Introduction

The prominence of LLMs in achieving state-of-the-art results in varied language-based tasks is well-documented. However, their deployment remains hampered by high computational needs. Expansion of batch sizes to enhance GPU utilization faces limitations on lower-specification GPUs. This research concentrates on structured pruning as a means to increase the accessibility of LLMs by enabling their deployment even on devices with stringent memory constraints.

Methodology

The authors' methodology is centered on investigating the efficiency of depth pruning, which they argue is commonly underestimated. Unlike traditional width pruning that may not significantly improve or even worsen generation speeds, depth pruning by eliminating entire Transformer blocks decreases the number of costly memory operations and matrix calculations, thus improving inference time. The paper evaluates the impact of removing blocks using metrics such as Taylor Expansion-based importance scores and Perplexity-based analyses. They prioritize empirical results over theoretical analysis, demonstrating that block-level pruning not only accelerates inference but also maintains comparable zero-shot task performance to finely-width-pruned models.

Results and Discussion

Supported by extensive experiments and empirical evidence, the paper clearly demonstrates the advantages of depth pruning. Numerically, the inference efficiencies highlight considerable speed improvements, especially notable for sequences with limited input scales. The authors also compare the pruned network's generation quality to its original form, exhibiting no significant loss in quality. From a resource usage perspective, their pruned models manage to efficiently reduce GPU memory requirements across various batch sizes.

Conclusion

In conclusion, the paper offers a compelling argument for the adoption of depth pruning techniques in LLMs, especially for deployment in memory-constrained environments. The authors substantiate their claim that even a simple depth pruning approach can achieve substantial improvements in inference speed without compromising the model's efficacy in zero-shot tasks. They encapsulate the need for further investigation into retraining methods and a deeper dive into calibration data while pointing out the importance of a hardware-agnostic approach in model compression, identifying their work as a critical pivot away from the mainstream focus on width pruning.