The Unreasonable Ineffectiveness of the Deeper Layers (2403.17887v2)

Published 26 Mar 2024 in cs.CL, cs.LG, and stat.ML

Abstract: How is knowledge stored in an LLM's weights? We study this via layer pruning: if removing a certain layer does not affect model performance in common question-answering benchmarks, then the weights in that layer are not necessary for storing the knowledge needed to answer those questions. To find these unnecessary parameters, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. Surprisingly, with this method we find minimal degradation of performance until after a large fraction (up to half) of the layers are removed for some common open-weight models. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge. For our study, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single 40GB A100 GPU.

References (88)

Citations (55)

View on Semantic Scholar

Summary

The paper demonstrates that significant pruning (20-55% of deeper layers) leads to minimal performance loss when combined with PEFT healing.
The methodology uses an angular distance measure to identify redundant layers, optimizing the pruning process in large language models.
Experimental results on models like Llama-2 and Qwen reveal that deeper layers exhibit redundancy, suggesting potential for more efficient model designs.

The Unreasonable Ineffectiveness of the Deeper Layers

Introduction

"The Unreasonable Ineffectiveness of the Deeper Layers" focuses on a layer-pruning strategy for LLMs. By evaluating popular families of open-weight pretrained LLMs, the paper reveals that a significant fraction of the deeper layers can be pruned with minimal impact on performance across various question-answering (QA) benchmarks. The paper utilizes a similarity-based approach to identify the optimal block of layers to prune, followed by a small amount of parameter-efficient finetuning (PEFT) using methods like QLoRA.

Methodology

The paper introduces a method for pruning layers based on their similarity, minimizing angular distance between layers to decide which ones to prune:

Layer-Pruning Strategy: The authors measure the angular distance between representations across layers, identifying the optimal layers to prune using:

$d(x^{(\ell)}, x^{(\ell+n)}) = \frac{1}{\pi} \text{arccos}\left(\frac{x^{(\ell)} \cdot x^{(\ell+n)}}{||x^{(\ell)}|| \, ||x^{(\ell+n)}||}\right)$

Pruning Process: Upon determining the optimal layers to drop, the paper performs finetuning (or "healing") using PEFT methods like QLoRA to mitigate pruning's effect.
Evaluation Metrics: The methodology is validated against diverse benchmarks such as MMLU and BoolQ, demonstrating the robustness of the strategy with different model architectures including Llama-2, Qwen, and Mistral.
Figure 1: MMLU accuracy (5-shot) vs. fraction of layers dropped for different model families.

Experimental Results

The experiments are conducted on LLMs ranging from 2.7B to 70B parameters. Key observations include:

Performance Robustness: Models maintained strong performance until approximately 20-55% of layers were pruned, depending on the model family.
Healing Effectiveness: Small amounts of healing improved robustness, showing continuity in C4 validation loss through sharp transitions observed in QA benchmarks.
Figure 2: Evaluation of Llama-2-70B with pruning strategies and respective performances before and after healing.

Discussion

Key insights from the findings include:

Layer Redundancy: Deeper layers show significant redundancy, which questions whether current pretraining leverages these parameters effectively.
Combination of Techniques: Embedding pruning with PEFT and quantization demonstrates compounded efficiency gains, substantial for model deployment on limited resources.
Scientific Implications: Discrepancies between QA benchmark jumps and autoregressive loss continuity suggest models might store knowledge non-locally or inefficiently utilize deeper parameters.

This research points toward not fully leveraged deeper layers in LLMs and encourages exploration of improved architectural advancements to exploit deeper-layer potential more beneficially.

Conclusion

By demonstrating a method that allows significant pruning without substantial performance degradation, the paper provides both practical and theoretical implications for LLM design and optimization. Future directions could include developing new architectural models that utilize deeper layers more efficiently, improving finetuning mechanisms, and broadening task evaluations to better understand layer-specific knowledge storage.