Emergent Mind

The Unreasonable Ineffectiveness of the Deeper Layers

(2403.17887)
Published Mar 26, 2024 in cs.CL , cs.LG , and stat.ML

Abstract

We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.

Impact of varying LoRA rank on layer pruning performance, showing accuracy and validation loss trends.

Overview

  • The paper introduces a pruning strategy for large-scale pretrained LLMs that shows significant fractions of deeper layers can be removed with minimal performance degradation, challenging the notion of their necessity.

  • The methodology involves computing angular distances between representations at different layers to identify and prune redundant layers, followed by parameter-efficient fine-tuning using quantization and Low-Rank Adapters (QLoRA).

  • Evaluation on various LLMs across benchmarks shows robust performance even after substantial pruning, highlighting practical implications for model efficiency and theoretical insights into parameter utilization and model architecture.

An Analysis of "The Unreasonable Ineffectiveness of the Deeper Layers"

The paper "The Unreasonable Ineffectiveness of the Deeper Layers" by Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A. Roberts investigates a layer-pruning strategy for large-scale open-weight pretrained LLMs. Their primary contribution is the empirical finding that significant fractions of model layers, particularly the deeper ones, can be pruned with minimal degradation in performance across various question-answering (QA) benchmarks. The implications of their work span both practical efficiency improvements and theoretical insights into the architecture and robustness of modern LLMs.

Summary of Findings

The key finding of this study is that models such as Llama-2-70B can tolerate pruning of up to nearly half of their layers before experiencing a critical degradation in performance. This robustness is observed across multiple models and benchmarks, indicating that the extra deep layers may not be as crucial as previously assumed. This challenges the current notion that deeper layers in LLMs are critical for maintaining high performance.

Methodology

To prune the models, the authors propose a method where the angular distance between representations at different layers, defined as: [ d(x{(\ell)},x{(\ell+n)}) = \frac{1}{\pi} \arccos \left( \frac{x{(\ell)}_T \cdot x{(\ell+n)}T}{\left|\left|x{(\ell)}T\right|\right| \left|\left|x{(\ell+n)}_T\right|\right| } \right), ] is computed across the network. Here, ( x{(\ell)} ) represents the activation at layer ( \ell ). They identify the most redundant block of layers to prune and, to mitigate any resulting performance drop, apply parameter-efficient fine-tuning (PEFT), specifically using quantization and Low-Rank Adapters (QLoRA). This combined strategy allows the researchers to perform significant pruning experiments on a single A100 GPU.

Evaluation

The effectiveness of this pruning strategy is evaluated on several LLMs, including the Llama-2, Qwen, Mistral, and Phi-2 models, using benchmarks such as MMLU (Massive Multitask Language Understanding) and BoolQ (Boolean Questions). Their experiments reveal:

  1. Performance Robustness: Models retain high performance on QA tasks up to pruning fractions of 20-55%, depending on the model family and size. For instance, Llama-2-70B retains robustness until approximately 50% of its layers are pruned.
  2. Healing Efficacy: After pruning, a small amount of fine-tuning (termed "healing") marginally but significantly improves the performance. This healing is especially critical for maintaining the autoregressive loss, which otherwise increases sharply without it.

Key Insights and Implications

Several theoretical and practical insights can be derived from these findings:

  1. Parameter Utilization: The robustness of LLMs to layer pruning suggests a potential inefficiency in the current utilization of deeper layers. Either current pretraining methods are not optimizing these parameters effectively, or the shallow layers are playing a disproportionately significant role in storing and processing information.
  2. Design of Efficient Models: Understanding that deeper layers can be pruned without severe performance loss opens pathways for designing more compute and memory-efficient models. This could significantly reduce the resource requirements for running large models, making them more accessible for practical applications such as real-time inference on consumer-grade hardware.
  3. Implications for Theoretical Research: The authors' results on sharpening the understanding of layer significance support a deeper investigation into the design and training procedures of LLMs. Specifically, whether different tasks require differing depths for optimal performance, and how layer-wise similarity metrics can guide further architectural refinements, remain open questions for future research.

Future Directions

The paper concludes by suggesting several directions for future research, such as exploring better layer-pruning and healing strategies, understanding the decoupling of QA performance from next-token prediction loss, and investigating how different pretraining methods and datasets influence the ability to prune. A particularly intriguing direction is examining the effective use of deeper layers, potentially leading to more advanced training paradigms that leverage all model parameters more efficiently.

In summary, this study significantly contributes to the understanding and practical handling of LLMs by demonstrating that substantial layer pruning is feasible and beneficial. This finding not only aids in resource optimization but also prompts a reevaluation of how these models are architecturally and functionally understood.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube