Emergent Mind

LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning

(2403.17919)
Published Mar 26, 2024 in cs.LG , cs.AI , cs.CL , and math.OC

Abstract

The machine learning community has witnessed impressive advancements since the first appearance of LLMs, yet their huge memory consumption has become a major roadblock to large-scale training. Parameter Efficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem, but their performance still fails to match full parameter training in most large-scale fine-tuning settings. Attempting to complement this deficiency, we investigate layerwise properties of LoRA on fine-tuning tasks and observe an uncommon skewness of weight norms across different layers. Utilizing this key observation, a surprisingly simple training strategy is discovered, which outperforms both LoRA and full parameter training in a wide range of settings with memory costs as low as LoRA. We name it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA, which applies the idea of importance sampling to different layers in LLMs and randomly freeze most middle layers during optimization. Experimental results show that with similar or less GPU memory consumption, LISA surpasses LoRA or even full parameter tuning in downstream fine-tuning tasks, where LISA consistently outperforms LoRA by over $11\%$-$37\%$ in terms of MT-Bench scores. On large models, specifically LLaMA-2-70B, LISA achieves on-par or better performance than LoRA on MT-Bench, GSM8K, and PubMedQA, demonstrating its effectiveness across different domains.

Comparison of LoRA, LISA, and full-parameter tuning loss curves on Alpaca-GPT4 dataset.

Overview

  • LISA introduces a novel Layerwise Importance Sampled AdamW approach for efficient fine-tuning of LLMs, aiming to reduce memory consumption while maintaining or enhancing training performance.

  • This method is inspired by an analysis of weight norms distributions across LLM layers when using LoRA, identifying a varied importance of layers which LISA capitalizes on by selectively updating crucial layers.

  • Experimental results demonstrate LISA's ability to outperform both LoRA and full parameter training in various settings, showing significant improvements in MT-Bench scores and performance on large models.

  • LISA's memory-efficient training strategy marks an advancement in LLM fine-tuning, offering potential for future research and applications in AI by enabling the training of models up to 70B parameters with reduced memory costs.

LISA: A Novel Approach for Efficient Large Language Model Fine-Tuning

Introduction to LISA

The quest for enhancing the efficiency of fine-tuning LLMs has led to the development of the Layerwise Importance Sampled AdamW (LISA). This approach targets a significant hurdle in the utilization of LLMs: the excessive memory consumption during large-scale training. While existing Parameter Efficient Fine-Tuning (PEFT) techniques, notably Low-Rank Adaptation (LoRA), have made strides in addressing this issue, they have not consistently outperformed full parameter training across all settings. LISA emerges as a strategic alternative by leveraging the layerwise properties of LoRA to optimize memory usage and training performance.

Motivation and Key Observations

The motivation behind LISA stems from an insightful analysis of LoRA's performance across different layers of LLMs. A notable skewness was observed in the weight norms across layers when employing LoRA for fine-tuning tasks. This uneven distribution of weight norms suggests a varied importance of layers in the training process—a foundational observation that inspired the development of LISA. By applying the concept of importance sampling strategically to LLM layers, LISA selectively updates only crucial layers, thereby significantly reducing memory consumption while enhancing or maintaining training effectiveness.

The LISA Algorithm

LISA operates by applying AdamW optimization selectively across layers based on predetermined probabilities, thereby freezing a majority of the middle layers during optimization. This selective updating process is designed to closely emulate LoRA's skewed updating pattern but without the inherent limitations tied to LoRA's low-rank space. Experimental results have bolstered LISA's potential, demonstrating its capability to outperform both LoRA and full parameter training across various settings with lower or similar memory costs.

Experimental Evaluation and Results

Extensive evaluations reveal LISA's impressive performance in fine-tuning tasks for modern LLMs. It consistently outperformed LoRA by over 11% to 37% in terms of MT-Bench scores and exhibited superior performance on large models like LLaMA-2-70B across different domains, including instruction following, medical QA, and math problems. Furthermore, LISA showed remarkable memory efficiency, enabling the training of models up to 70B parameters with reduced GPU memory consumption in comparison to LoRA.

Implications and Future Directions

The introduction of LISA marks a significant advancement in the field of LLM fine-tuning. Its memory-efficient training strategy offers a practical solution to the challenges associated with large-scale LLM training. The strong numerical results and the ability to surpass existing PEFT techniques underscore LISA's potential as a promising tool for future research and applications involving LLMs. Looking ahead, further exploration into optimized layerwise importance sampling strategies and extending LISA's application to even larger models are promising directions for extending LISA's utility in the realm of AI.

In summary, LISA's innovative approach to layerwise importance sampling represents a notable leap forward in the efficient and effective fine-tuning of LLMs. Its ability to conserve memory while delivering improved performance metrics opens new avenues for research and practical applications of LLMs across various domains.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube