Papers
Topics
Authors
Recent
2000 character limit reached

LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning (2403.17919v4)

Published 26 Mar 2024 in cs.LG, cs.AI, cs.CL, and math.OC

Abstract: The machine learning community has witnessed impressive advancements since LLMs first appeared. Yet, their massive memory consumption has become a significant roadblock to large-scale training. For instance, a 7B model typically requires at least 60 GB of GPU memory with full parameter training, which presents challenges for researchers without access to high-resource environments. Parameter Efficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem. However, in most large-scale fine-tuning settings, their performance does not reach the level of full parameter training because they confine the parameter search to a low-rank subspace. Attempting to complement this deficiency, we investigate the layerwise properties of LoRA on fine-tuning tasks and observe an unexpected but consistent skewness of weight norms across different layers. Utilizing this key observation, a surprisingly simple training strategy is discovered, which outperforms both LoRA and full parameter training in a wide range of settings with memory costs as low as LoRA. We name it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA, which applies the idea of importance sampling to different layers in LLMs and randomly freezes most middle layers during optimization. Experimental results show that with similar or less GPU memory consumption, LISA surpasses LoRA or even full parameter tuning in downstream fine-tuning tasks, where LISA consistently outperforms LoRA by over 10%-35% in terms of MT-Bench score while achieving on-par or better performance in MMLU, AGIEval and WinoGrande. On large models, specifically LLaMA-2-70B, LISA surpasses LoRA on MT-Bench, GSM8K, and PubMedQA, demonstrating its effectiveness across different domains.

Citations (24)

Summary

  • The paper introduces LISA, a novel approach that selectively freezes less critical layers to enhance memory efficiency during LLM fine-tuning.
  • It employs layerwise importance sampling with AdamW to achieve significant memory savings while maintaining or improving model performance compared to LoRA.
  • Empirical results demonstrate up to 36% performance gains on benchmarks like MT-Bench, confirming LISA's effectiveness in balancing memory and accuracy.

LISA: Layerwise Importance Sampling for Memory-Efficient LLM Fine-Tuning

Introduction

The paper "LISA: Layerwise Importance Sampling for Memory-Efficient LLM Fine-Tuning" (2403.17919) introduces a novel approach named Layerwise Importance Sampled AdamW (LISA). The motivation behind LISA is to address the significant memory consumption issues faced by LLMs during fine-tuning. While techniques like Low-Rank Adaptation (LoRA) offer memory savings, they often fall short of achieving performance comparable to full parameter tuning. LISA proposes an alternative that surpasses both LoRA and full parameter tuning in various tasks with similar or reduced memory costs.

Methodology

Observations and Motivation

In exploring LoRA's performance, the paper identifies a skewed distribution of weight norms across layers during training. Specifically, certain layers, such as the bottom and top layers, dominate weight updates, while intermediate layers experience minimal changes. Figure 1

Figure 1

Figure 1: Layer-wise weight norms during training of GPT2 and LLaMA-2-7B Model with LoRA and Full Parameters training.

This observation drives the key insight behind LISA: not all layers hold equal importance during updates. This motivates the adoption of an importance sampling strategy, selectively freezing layers deemed less critical during optimization. This approach effectively imitates LoRA’s pattern without its inherent limitations in low-rank representation.

Layerwise Importance Sampled AdamW (LISA)

LISA introduces a simple method employing importance sampling where layers are randomly frozen based on their computed importance. The method strategically updates essential layers to achieve memory-efficient training without sacrificing performance. LISA’s sampling configuration, with probabilities assigned to each layer, ensures that the total memory footprint aligns closely with LoRA, but with distinct performance advantages.

Experimental Results

The empirical analysis demonstrates LISA's efficacy across various LLMs and tasks. Experiments cover moderate scale fine-tuning datasets like Alpaca-GPT4, employing evaluation benchmarks such as MT-Bench, GSM8K, and PubMedQA.

Memory Efficiency

LISA achieves considerable memory savings, comparable or superior to LoRA. The memory reduction results from the selective updating of layers, which minimizes unnecessary parameter adjustments.

Fine-Tuning Performance

In the evaluation on instruction-following tasks, LISA consistently exceeds LoRA by 8%-36% in MT-Bench scores. In large models like LLaMA-2-70B, it performs on par or better than LoRA in terms of both domain-specific tasks and general fine-tuning tasks. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Loss curves for LoRA, LISA and full-parameter tuning on the Alpaca-GPT4 dataset across different models.

Ablation Studies

The paper explores the impact of LISA's hyperparameters, such as the number of sampling layers (gamma) and the sampling period (K). It concludes that a higher number of sampled layers and a longer sampling period substantially enhance performance, though proper tuning is critical for maximizing efficiency. Figure 3

Figure 3: Comparison of loss curves for the gamma (number of sampling layers) ablation experiment.

Figure 4

Figure 4: Comparison of loss curves for the sampling period K ablation experiment.

Conclusion

LISA emerges as an effective alternative to existing parameter-efficient fine-tuning strategies, presenting substantial benefits in both memory efficiency and task performance. By leveraging layerwise importance sampling, LISA achieves significant strides in memory consumption reduction while either matching or surpassing the performance of full parameter training and LoRA. Future explorations could involve refining the importance sampling strategy to further boost the optimization efficiency and exploring integrations with advanced memory reduction techniques.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 19 tweets with 855 likes about this paper.