Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Outlier-weighed Layerwise Sampling for LLM Fine-tuning (2405.18380v3)

Published 28 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: The rapid advancements in LLMs have revolutionized various natural language processing tasks. However, the substantial size of LLMs presents significant challenges in training or fine-tuning. While parameter-efficient approaches such as low-rank adaptation (LoRA) have gained popularity, they often compromise performance compared to full-rank fine-tuning. In this paper, we propose Outlier-weighed Layerwise Sampling (OWS), a new memory-efficient fine-tuning approach, inspired by the layerwise outlier distribution of LLMs. Unlike LoRA, which adds extra adapters to all layers, OWS strategically assigns higher sampling probabilities to layers with more outliers, selectively sampling only a few layers and fine-tuning their pre-trained weights. To further increase the number of fine-tuned layers without a proportional rise in memory costs, we incorporate gradient low-rank projection, further boosting the approach's performance. Our extensive experiments across various architectures, including LLaMa2 and Mistral, demonstrate that OWS consistently outperforms baseline approaches, including full fine-tuning. Specifically, it achieves up to a 1.1% average accuracy gain on the Commonsense Reasoning benchmark, a 3.0% improvement on MMLU, and a notable 10% boost on MT-Bench, while being more memory efficient. OWS allows us to fine-tune 7B LLMs with only 21GB of memory. Our code is available at https://github.com/pixeli99/OWS.

References (60)

Citations (3)

View on Semantic Scholar

Summary

The paper presents a novel approach that integrates outlier-weighted layerwise sampling with low-rank gradient projection for efficient LLM fine-tuning.
It employs heavy-tailed self-regularization to identify layers with higher outlier density, ensuring targeted and effective optimization.
Experimental results show significant improvements on benchmarks like Commonsense Reasoning, MT-Bench, and MMLU while lowering memory requirements.

A Comprehensive Analysis of Outlier-weighed Layerwise Sampled Low-Rank Projection (OwLore) for LLM Fine-tuning

The substantial capabilities of LLMs have propelled advancements in various NLP tasks. However, the significant size of these models poses considerable challenges in terms of training and fine-tuning, especially regarding memory efficiency. This paper introduces Outlier-weighed Layerwise Sampled Low-Rank Projection (OwLore), a novel and memory-efficient fine-tuning approach that combines insights from Heavy-Tailed Self-Regularization (HT-SR) theory with layerwise outlier distribution for optimal layer sampling and low-rank training.

Key Contributions

This work makes several important contributions to the field of LLM fine-tuning:

Outlier Distribution and Heavy-Tailed Self-Regularization Theory: The authors interpret the layerwise outlier distribution of LLMs through the lens of HT-SR theory, revealing that layers with more outliers exhibit a more heavy-tailed empirical spectral density (ESD) and are thereby better trained. This observation forms the basis for their layerwise sampling strategy.
Outlier-weighed Sampling: Inspired by the non-uniform distribution of outliers, OwLore's sampling strategy assigns higher probabilities to layers with more outliers. This principle efficiently utilizes the well-trained layers in pre-trained LLMs, improving the performance of sampling-based fine-tuning methods.
Gradient Low-Rank Projection: To address the memory demands of full-rank training, OwLore integrates gradient low-rank projection. This allows each layer to be efficiently trained within a low-rank subspace, thus mitigating memory costs without compromising performance.

Methodology

OwLore innovates by combining two primary strategies: outlier-weighed sampling and low-rank gradient updates.

Outlier-weighed Sampling:

The authors compute the Layerwise Outlier Distribution (LOD) and allocate sampling probabilities proportional to the density of outliers in each layer. This approach creates a "rich-get-richer" phenomenon, where well-trained layers are sampled and fine-tuned more frequently.

Low-Rank Gradient Updates:

By adopting the GaLore method, OwLore projects gradients into a low-rank subspace, significantly reducing memory overhead. The optimizer states are updated within this subspace, with the gradient subspace being refreshed periodically to capture dynamic changes during training.

Experimental Results

The empirical evaluation of OwLore demonstrates its robustness and efficiency across multiple architectures and benchmarks, including LLaMa2, LLaMa3, and Mistral. Noteworthy results include:

Commonsense Reasoning:

OwLore achieves up to a 1.1% average accuracy gain on the Commonsense Reasoning benchmark and consistently outperforms other fine-tuning approaches, including full fine-tuning.

MT-Bench:

OwLore records a 10% improvement in the MT-Bench evaluation, particularly excelling in multi-turn question-answering and instruction-following tasks.

MMLU:

OwLore achieves a 3.0% improvement on the MMLU benchmark, highlighting its robustness across diverse knowledge domains.

Additionally, OwLore allows fine-tuning LLaMa2-7B with only 21GB of memory, significantly lower than other methods.

Implications and Future Work

The introduction of OwLore advances the field of LLM fine-tuning by offering a method that balances performance and memory efficiency. Theoretically, it builds on HT-SR theory to provide a principled approach to layerwise sampling. Practically, its memory-efficient design makes it suitable for deploying large-scale LLMs in resource-constrained environments.

Future developments could explore further optimization of the low-rank subspace updating mechanisms and their impacts on training dynamics. Additionally, extending OwLore's principles to other domains such as computer vision and multi-modal models could prove beneficial, given the increasing prevalence of large, multi-task models in these fields.

In summary, OwLore represents a significant step forward in parameter-efficient fine-tuning of LLMs, setting a new benchmark in both memory usage and model performance. The insights derived from its development offer a fertile ground for future research aiming to optimize the fine-tuning process of large-scale neural networks.

PDF Markdown

Tweets

https://twitter.com/ShiweiLiu9/status/1796214387289559168

https://twitter.com/CalcCon/status/1798938069808365760

https://twitter.com/papers_anon/status/1803973680671920348

https://twitter.com/MickeyShaughnes/status/1795875037289001310