GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507v2)
Abstract: Training LLMs presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.
- Memory Efficient Adaptive Optimization.
- Continual Learning in Low-rank Orthogonal Subspaces. In Advances in Neural Information Processing Systems, volume 33, pp. 9900–9911. Curran Associates, Inc., 2020.
- Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression. Journal of Machine Learning Research, 20(5):1–37, 2019. ISSN 1533-7928.
- Training Deep Nets with Sublinear Memory Cost, April 2016.
- Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees, September 2015.
- PaLM: Scaling Language Modeling with Pathways, October 2022.
- 8-bit Optimizers via Block-wise Quantization. arXiv:2110.02861 [cs], October 2021.
- 8-bit Optimizers via Block-wise Quantization, June 2022.
- QLoRA: Efficient Finetuning of Quantized LLMs, May 2023.
- Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models, March 2022.
- Gradient Descent Happens in a Tiny Subspace, December 2018.
- LoRA: Low-Rank Adaptation of Large Language Models, October 2021.
- Exploring Low Rank Training of Deep Neural Networks, September 2022.
- Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], December 2014.
- How many degrees of freedom do we need to train deep networks: A loss landscape perspective, February 2022.
- Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace, June 2018.
- Memory Efficient Optimizers with 4-bit States. https://arxiv.org/abs/2309.01507v3, September 2023.
- ReLoRA: High-Rank Training Through Low-Rank Updates, December 2023.
- Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources. URL http://arxiv.org/abs/1904.12043.
- Decoupled Weight Decay Regularization, January 2019.
- Full Parameter Fine-tuning for Large Language Models with Limited Resources, June 2023.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, September 2023.
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, May 2020.
- Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying, November 2023.
- Shazeer, N. GLU Variants Improve Transformer, February 2020.
- Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters, November 2023.
- Understanding self-supervised learning with dual deep networks. arXiv preprint arXiv:2010.00578, 2020.
- Joma: Demystifying multilayer transformers via joint dynamics of mlp and attention. ICLR, 2024.
- Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023.
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, February 2019.
- MultiLoRA: Democratizing LoRA for Better Multi-Task Learning, November 2023.
- Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning, January 2024.
- Root Mean Square Layer Normalization, October 2019.