Emergent Mind

Abstract

The rapid advancements in LLMs have revolutionized various natural language processing tasks. However, the substantial size of LLMs presents significant challenges in training or fine-tuning. While parameter-efficient approaches such as low-rank adaptation (LoRA) have gained popularity, they often compromise performance compared to full-rank fine-tuning. In this paper, we propose Outlier-weighed Layerwise Sampled Low-Rank Projection (OwLore), a new memory-efficient fine-tuning approach, inspired by the layerwise outlier distribution of LLMs, which dynamically samples pre-trained layers to fine-tune instead of adding additional adaptors. We first interpret the outlier phenomenon through the lens of Heavy-Tailed Self-Regularization theory (HT-SR), discovering that layers with more outliers tend to be more heavy-tailed and consequently better trained. Inspired by this finding, OwLore strategically assigns higher sampling probabilities to layers with more outliers to better leverage the knowledge stored in pre-trained LLMs. To further mitigate the memory demands of fine-tuning, we integrate gradient low-rank projection into our approach, which facilitates each layer to be efficiently trained in a low-rank manner. By incorporating the efficient characteristics of low-rank and optimal layerwise sampling, OwLore significantly improves the memory-performance trade-off in LLM pruning. Our extensive experiments across various architectures, including LLaMa2, LLaMa3, and Mistral, demonstrate that OwLore consistently outperforms baseline approaches, including full fine-tuning. Specifically, it achieves up to a 1.1% average accuracy gain on the Commonsense Reasoning benchmark, a 3.0% improvement on MMLU, and a notable 10% boost on MT-Bench, while being more memory efficient. OwLore allows us to fine-tune LLaMa2-7B with only 21GB of memory.

Fine-tuning LLaMA2-7B using OwLore on GSM-8K dataset with different sampled layers.

Overview

  • The paper introduces OwLore, a memory-efficient fine-tuning method for LLMs that leverages Heavy-Tailed Self-Regularization (HT-SR) theory for layerwise outlier sampling and low-rank gradient projection.

  • Key contributions include the development of an outlier-weighed sampling strategy and a low-rank gradient update method, which improve training efficiency and reduce memory usage without sacrificing performance.

  • Experimental results show OwLore's superior performance in various benchmarks such as Commonsense Reasoning, MT-Bench, and MMLU, demonstrating its robustness and efficiency across different architectures like LLaMa2 and Mistral.

A Comprehensive Analysis of Outlier-weighed Layerwise Sampled Low-Rank Projection (OwLore) for Large Language Model Fine-tuning

The substantial capabilities of LLMs have propelled advancements in various NLP tasks. However, the significant size of these models poses considerable challenges in terms of training and fine-tuning, especially regarding memory efficiency. This paper introduces Outlier-weighed Layerwise Sampled Low-Rank Projection (OwLore), a novel and memory-efficient fine-tuning approach that combines insights from Heavy-Tailed Self-Regularization (HT-SR) theory with layerwise outlier distribution for optimal layer sampling and low-rank training.

Key Contributions

This work makes several important contributions to the field of LLM fine-tuning:

  1. Outlier Distribution and Heavy-Tailed Self-Regularization Theory: The authors interpret the layerwise outlier distribution of LLMs through the lens of HT-SR theory, revealing that layers with more outliers exhibit a more heavy-tailed empirical spectral density (ESD) and are thereby better trained. This observation forms the basis for their layerwise sampling strategy.

  2. Outlier-weighed Sampling: Inspired by the non-uniform distribution of outliers, OwLore's sampling strategy assigns higher probabilities to layers with more outliers. This principle efficiently utilizes the well-trained layers in pre-trained LLMs, improving the performance of sampling-based fine-tuning methods.

  3. Gradient Low-Rank Projection: To address the memory demands of full-rank training, OwLore integrates gradient low-rank projection. This allows each layer to be efficiently trained within a low-rank subspace, thus mitigating memory costs without compromising performance.

Methodology

OwLore innovates by combining two primary strategies: outlier-weighed sampling and low-rank gradient updates.

  • Outlier-weighed Sampling: The authors compute the Layerwise Outlier Distribution (LOD) and allocate sampling probabilities proportional to the density of outliers in each layer. This approach creates a "rich-get-richer" phenomenon, where well-trained layers are sampled and fine-tuned more frequently.

  • Low-Rank Gradient Updates: By adopting the GaLore method, OwLore projects gradients into a low-rank subspace, significantly reducing memory overhead. The optimizer states are updated within this subspace, with the gradient subspace being refreshed periodically to capture dynamic changes during training.

Experimental Results

The empirical evaluation of OwLore demonstrates its robustness and efficiency across multiple architectures and benchmarks, including LLaMa2, LLaMa3, and Mistral. Noteworthy results include:

  • Commonsense Reasoning: OwLore achieves up to a 1.1% average accuracy gain on the Commonsense Reasoning benchmark and consistently outperforms other fine-tuning approaches, including full fine-tuning.

  • MT-Bench: OwLore records a 10% improvement in the MT-Bench evaluation, particularly excelling in multi-turn question-answering and instruction-following tasks.

  • MMLU: OwLore achieves a 3.0% improvement on the MMLU benchmark, highlighting its robustness across diverse knowledge domains.

Additionally, OwLore allows fine-tuning LLaMa2-7B with only 21GB of memory, significantly lower than other methods.

Implications and Future Work

The introduction of OwLore advances the field of LLM fine-tuning by offering a method that balances performance and memory efficiency. Theoretically, it builds on HT-SR theory to provide a principled approach to layerwise sampling. Practically, its memory-efficient design makes it suitable for deploying large-scale language models in resource-constrained environments.

Future developments could explore further optimization of the low-rank subspace updating mechanisms and their impacts on training dynamics. Additionally, extending OwLore's principles to other domains such as computer vision and multi-modal models could prove beneficial, given the increasing prevalence of large, multi-task models in these fields.

In summary, OwLore represents a significant step forward in parameter-efficient fine-tuning of LLMs, setting a new benchmark in both memory usage and model performance. The insights derived from its development offer a fertile ground for future research aiming to optimize the fine-tuning process of large-scale neural networks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.