GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507v2)
Abstract: Training LLMs presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.
- Memory Efficient Adaptive Optimization.
- Continual Learning in Low-rank Orthogonal Subspaces. In Advances in Neural Information Processing Systems, volume 33, pp. 9900–9911. Curran Associates, Inc., 2020.
- Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression. Journal of Machine Learning Research, 20(5):1–37, 2019. ISSN 1533-7928.
- Training Deep Nets with Sublinear Memory Cost, April 2016.
- Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees, September 2015.
- PaLM: Scaling Language Modeling with Pathways, October 2022.
- 8-bit Optimizers via Block-wise Quantization. arXiv:2110.02861 [cs], October 2021.
- 8-bit Optimizers via Block-wise Quantization, June 2022.
- QLoRA: Efficient Finetuning of Quantized LLMs, May 2023.
- Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models, March 2022.
- Gradient Descent Happens in a Tiny Subspace, December 2018.
- LoRA: Low-Rank Adaptation of Large Language Models, October 2021.
- Exploring Low Rank Training of Deep Neural Networks, September 2022.
- Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], December 2014.
- How many degrees of freedom do we need to train deep networks: A loss landscape perspective, February 2022.
- Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace, June 2018.
- Memory Efficient Optimizers with 4-bit States. https://arxiv.org/abs/2309.01507v3, September 2023.
- ReLoRA: High-Rank Training Through Low-Rank Updates, December 2023.
- Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources. URL http://arxiv.org/abs/1904.12043.
- Decoupled Weight Decay Regularization, January 2019.
- Full Parameter Fine-tuning for Large Language Models with Limited Resources, June 2023.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, September 2023.
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, May 2020.
- Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying, November 2023.
- Shazeer, N. GLU Variants Improve Transformer, February 2020.
- Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters, November 2023.
- Understanding self-supervised learning with dual deep networks. arXiv preprint arXiv:2010.00578, 2020.
- Joma: Demystifying multilayer transformers via joint dynamics of mlp and attention. ICLR, 2024.
- Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023.
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, February 2019.
- MultiLoRA: Democratizing LoRA for Better Multi-Task Learning, November 2023.
- Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning, January 2024.
- Root Mean Square Layer Normalization, October 2019.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.