Emergent Mind

VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections

(2405.17991)

Published May 28, 2024 in cs.CV and cs.AI

Abstract

LLMs have recently emerged as powerful tools for tackling many language-processing tasks. Despite their success, training and fine-tuning these models is still far too computationally and memory intensive. In this paper, we identify and characterise the important components needed for effective model convergence using gradient descent. In doing so we find that the intermediate activations used to implement backpropagation can be excessively compressed without incurring any degradation in performance. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs. The proposed algorithm simply divides the tokens up into smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace during the forward pass. These features are then coarsely reconstructed during the backward pass to implement the update rules. We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for fine-tuning LLaMA and show competitive performance against other memory-efficient pre-training methods on the large-scale C4 dataset.

PEFT and VeLoRA reduce memory overhead for backpropagation by optimizing activations and weights storage.

Overview

VeLoRA (Vector Projected LoRA) is introduced as a method to address the memory consumption issues in training LLMs by compressing intermediate activations into a one-dimensional subspace during forward pass and reconstructing them during the backward pass.
Experimental results demonstrate significant memory savings without compromising performance, validated across various benchmarks including VTAB-1k, GLUE, and MMLU, as well as vision transformers and LLMs like LLaMA.
The adoption of VeLoRA could democratize access to advanced AI research by reducing hardware requirements, although future exploration is needed for its application beyond Transformer-based models and its effect on total training time.

Vector Projected LoRA (VeLoRA): A Novel Approach for Efficient Training of LLMs

Introduction

The exponential growth in the size of LLMs has posed significant challenges in terms of computational expense and memory consumption during training. Recent advancements in natural language processing have showcased the potential of LLMs, but their practical implementation often encounters bottlenecks due to the substantial resources required for storing intermediate activations and computing gradients. Several techniques, such as GaLore, gradient checkpointing, and activation offloading, have been developed to mitigate these memory constraints; however, they still introduce a notable computational overhead, limit memory savings, or necessitate specialized hardware.

Motivation and Objective

Given the primary role of compute power in advancing machine learning, and the expectation that LLM sizes will continue to grow, developing methods that are both efficient and scalable is imperative. This paper introduces Vector Projected LoRA (VeLoRA), a novel approach designed to address the memory consumption issue without compromising model performance. VeLoRA exploits the observation that intermediate activations can be effectively compressed and reconstructed using a fixed one-dimensional projection vector, significantly reducing the memory required for backward propagation.

Proposed Method: VeLoRA

VeLoRA achieves memory efficiency by projecting intermediate activations onto a lower-dimensional subspace during the forward pass and reconstructing them during the backward pass. The process involves the following key steps:

Grouping: Dividing input tokens into smaller sub-tokens.
Projection: Using a single, fixed projection vector initialized with first-order batch statistics to compress these sub-tokens into a one-dimensional subspace.
Reconstruction: Reconstructing the original tokens during the backward pass using the same projection vector.

This compression method is computationally light as it avoids costly operations such as Singular Value Decomposition (SVD) and gradient checkpointing. Furthermore, the fixed nature of the projection vector means there is no need to update it throughout the training, thus reducing computational overhead.

Experimental Results

The efficacy of VeLoRA was validated across different benchmarks, including VTAB-1k, GLUE, and MMLU, as well as tasks involving both moderate-size vision transformers and LLMs such as LLaMA. VeLoRA demonstrated substantial memory reductions without sacrificing performance:

Vision Experiments (VTAB-1k): Combined with various PEFT methods like SSF, Hydra, and LoRA, VeLoRA lowered memory requirements while either maintaining or improving accuracy.
Roberta Experiments: On the GLUE benchmark, VeLoRA reduced memory consumption by up to 45\% compared to full fine-tuning, with only a minor decrease in performance.
Scaling to LLaMA Models: When applied with QLoRA to LLMs, VeLoRA offered significant memory savings (15-14.4\%) while enhancing performance on tasks like Alpaca and evaluation benchmarks such as MMLU.
Pre-training on C4: VeLoRA outperformed competing methods in pre-training scenarios, showing lower validation perplexity and reducing on-device GPU memory usage.

Implications and Future Directions

The results indicate that VeLoRA can significantly alleviate memory constraints in training LLMs, thereby enabling the use of larger models on existing hardware. This method's implications are vast, potentially democratizing access to advanced AI research by lowering the hardware barrier. Furthermore, VeLoRA's compatibility with existing PEFT methods enables more efficient fine-tuning, aligning well with current trends toward resource-efficient AI development.

Conclusion

VeLoRA represents a significant step forward in addressing the memory bottlenecks associated with training large-scale LLMs. By compressing intermediate activations into a fixed low-dimensional space, VeLoRA offers a practical solution that enhances memory efficiency while maintaining model performance. This method's ability to integrate seamlessly with other PEFT techniques and its broad applicability across various model sizes and tasks underscore its potential to become a standard approach in the efficient training of neural networks.

Limitations and Broader Impact

The applicability of VeLoRA beyond Transformer-based models remains to be explored, and although the method significantly alleviates memory constraints, it does not address the total training time. As VeLoRA facilitates access to high-quality research for institutions with limited resources, it simultaneously raises concerns about the misuse of advanced AI technologies. Ensuring responsible use and continued assessment of the socio-ethical impact remains crucial as the field progresses.

Create an account to read this summary for free:

https://twitter.com/_akhaliq/status/1795651536497864831

https://twitter.com/gm8xx8/status/1795647826665558526

https://twitter.com/javaeeeee1/status/1795775016761766161