Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507v2)

Published 6 Mar 2024 in cs.LG

Abstract: Training LLMs presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Memory Efficient Adaptive Optimization.
  2. Continual Learning in Low-rank Orthogonal Subspaces. In Advances in Neural Information Processing Systems, volume 33, pp.  9900–9911. Curran Associates, Inc., 2020.
  3. Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression. Journal of Machine Learning Research, 20(5):1–37, 2019. ISSN 1533-7928.
  4. Training Deep Nets with Sublinear Memory Cost, April 2016.
  5. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees, September 2015.
  6. PaLM: Scaling Language Modeling with Pathways, October 2022.
  7. 8-bit Optimizers via Block-wise Quantization. arXiv:2110.02861 [cs], October 2021.
  8. 8-bit Optimizers via Block-wise Quantization, June 2022.
  9. QLoRA: Efficient Finetuning of Quantized LLMs, May 2023.
  10. Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models, March 2022.
  11. Gradient Descent Happens in a Tiny Subspace, December 2018.
  12. LoRA: Low-Rank Adaptation of Large Language Models, October 2021.
  13. Exploring Low Rank Training of Deep Neural Networks, September 2022.
  14. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], December 2014.
  15. How many degrees of freedom do we need to train deep networks: A loss landscape perspective, February 2022.
  16. Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace, June 2018.
  17. Memory Efficient Optimizers with 4-bit States. https://arxiv.org/abs/2309.01507v3, September 2023.
  18. ReLoRA: High-Rank Training Through Low-Rank Updates, December 2023.
  19. Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources. URL http://arxiv.org/abs/1904.12043.
  20. Decoupled Weight Decay Regularization, January 2019.
  21. Full Parameter Fine-tuning for Large Language Models with Limited Resources, June 2023.
  22. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, September 2023.
  23. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, May 2020.
  24. Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying, November 2023.
  25. Shazeer, N. GLU Variants Improve Transformer, February 2020.
  26. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.
  27. S-LoRA: Serving Thousands of Concurrent LoRA Adapters, November 2023.
  28. Understanding self-supervised learning with dual deep networks. arXiv preprint arXiv:2010.00578, 2020.
  29. Joma: Demystifying multilayer transformers via joint dynamics of mlp and attention. ICLR, 2024.
  30. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023.
  31. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, February 2019.
  32. MultiLoRA: Democratizing LoRA for Better Multi-Task Learning, November 2023.
  33. Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning, January 2024.
  34. Root Mean Square Layer Normalization, October 2019.
Citations (118)

Summary

  • The paper demonstrates that gradient matrices become increasingly low-rank during training, motivating the projection approach for significant memory efficiency.
  • It introduces GaLore, which projects gradients into a low-rank subspace to enable full-parameter learning while sharply reducing memory overhead compared to traditional methods.
  • Experimental results confirm that GaLore maintains competitive performance and allows pre-training large LLMs on consumer GPUs, revolutionizing memory-efficient model training.

Memory-Efficient LLM Training with Gradient Low-Rank Projection

Introduction to Gradient Low-Rank Projection (\lowrank{})

Training LLMs poses significant memory challenges due to the large size of weights and optimizer states involved. Existing memory reduction techniques often involve low-rank adaptation methods, such as Low-Rank Adaptation (LoRA), which reparameterizes each layer's weight matrix as a sum of its original weight and a trainable low-rank matrix. Despite their efficacy in reducing the number of trainable parameters and associated optimizer states, these methods often underperform compared to full-rank training, especially in both pre-training and fine-tuning stages. This limitation is attributed mainly to the restrictive nature of low-rank parameterization and its alteration of training dynamics.

To address these challenges, we introduce Gradient Low-Rank Projection (\textbf{\lowrank{}}), a training strategy designed for both pre-training and fine-tuning LLMs that is more memory-efficient than traditional low-rank methods. Unlike LoRA, which directly imposes a low-rank structure on model weights, \lowrank{} capitalizes on the inherently low-rank structure of gradient updates during training. This strategy enables full-parameter learning while significantly reducing memory consumption.

Theoretical Insights and Methodology

Our work starts with demonstrating theoretically that the backpropagated gradient matrix becomes increasingly low-rank as training progresses. This insight leads to the core idea of \lowrank{}: projecting gradients into a low-rank subspace before applying optimizer updates. Specifically, for any weight update at time step tt, \lowrank{} projects the gradient GtG_t onto matrices PtRm×rP_t \in \mathbb{R}^{m \times r} and QtRn×rQ_t \in \mathbb{R}^{n \times r}, yielding a low-rank gradient form. Consequently, only the gradients' low-rank projections need to be stored in optimizer states, resulting in substantial memory savings.

Moreover, we provide a convergence analysis of \lowrank{} under certain gradient update forms, ensuring its effectiveness in both theoretical and practical settings. Importantly, \lowrank{} allows for dynamic adjustments of projection matrices during training, thus supporting full-parameter learning without increasing memory load.

Experimental Results

We thoroughly evaluate \lowrank{} on LLaMA-based models across different sizes, from 60M to 7B parameters, utilizing the C4 dataset for pre-training. Our findings indicate that \lowrank{} closely matches the performance of full-rank models while significantly reducing memory usage, proving its superiority over traditional low-rank adaptation methods like LoRA and ReLoRA. In particular, for a 7B parameter model, \lowrank{}, combined with 8-bit optimizer techniques and layer-wise weight updates, substantially outperforms full-rank training in memory efficiency without sacrificing training effectiveness.

Notably, the memory savings enabled by 8-bit \lowrank{} make it feasible to pre-train a 7B parameter model on consumer-level GPUs, such as NVIDIA RTX 4090, demonstrating its practical utility for large-scale LLM training within constrained memory environments.

Concluding Thoughts and Future Directions

\lowrank{} exemplifies a novel approach to memory-efficient training of LLMs by exploiting the low-rank structure of gradient updates. Its effectiveness in both pre-training and fine-tuning contexts signifies a notable advancement towards reducing the computational and environmental costs associated with LLM training. Looking forward, exploring further optimizations of \lowrank{}, including more memory-efficient projection matrices and its applicability to other model architectures and optimization strategies, presents promising avenues for continuing research in this area.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com