Emergent Mind

Abstract

We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Experiments on finetuning RoBERTa and LLaMA-2 (7B and 70B) demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines and enables aggressive quantization to sub-3 bits with only minor performance degradations. When finetuned on a language modeling calibration dataset, LQ-LoRA can also be used for model compression; in this setting our 2.75-bit LLaMA-2-70B model (which has 2.85 bits on average when including the low-rank components and requires 27GB of GPU memory) performs respectably compared to the 16-bit baseline.

LQ-LoRA LLaMA-2 models' performance on C4/Wikipedia/MMLU; Vicuna eval based on OpenAssistant dataset.

Overview

  • LQ-LoRA introduces a memory-efficient method for fine-tuning LLMs by decomposing weight matrices into a low-rank component for updates and a fixed quantized component.

  • The method employs randomized Singular Value Decomposition (SVD) and NormalFloat quantization, along with a mixed-quantization strategy optimized via integer linear programming to balance storage constraints and quantization error.

  • Evaluations on tasks like language modeling, instruction tuning, and the GLUE benchmark demonstrate that LQ-LoRA outperforms existing methods, especially under aggressive quantization settings.

An Analytical Summary of LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

The paper presents LQ-LoRA, a novel approach for memory-efficient adaptation of pretrained LLMs. This solution involves decomposing each pretrained weight matrix into a low-rank component, which is updated during fine-tuning, and a quantized component, which remains fixed. This articulation is aimed at significant reductions in memory footprint during the fine-tuning phase.

Methodology and Key Components

1. Low-Rank Plus Quantized Matrix Decomposition

LQ-LoRA's primary innovation is the iterative decomposition of a pretrained matrix, W, into a quantized matrix, Q, and a low-rank matrix, L1L2, through a simple yet effective algorithm. The systematic process includes:

  1. Initialization: Q is initialized to zero.
  2. Low-Rank Approximation: Employing randomized Singular Value Decomposition (SVD) to approximate W - Q.
  3. Quantization: Applying NormalFloat (NF) quantization to the residual matrix (W - L1L2).

The decomposition iteratively minimizes the decomposition error until specified criteria are met, ensuring that the high-variance subspaces of W are captured by L1L2.

2. Mixed-Configuration Quantization via Integer Linear Programming

Addressing the variability in the significance of different layers and weights, the authors propose a mixed-quantization strategy optimized using integer linear programming (ILP). This allows for dynamic bit-width and configuration allocation across different matrices while meeting an overall memory budget. The optimization criterion balances both storage constraints and the error introduced by quantization.

3. Data-Aware Matrix Decomposition

The paper extends the basic matrix decomposition algorithm by incorporating a diagonal approximation of the Fisher Information Matrix. This Fisher-weighted SVD ensures a data-aware factorization improving robustness to quantization noise, reflecting the sensitivity of each parameter to perturbations as experienced by calibration data.

Experimental Evaluation

The authors evaluated LQ-LoRA on three primary tasks: language modeling, instruction tuning, and finetuning on the GLUE benchmark, using LLaMA-2 and RoBERTa-Large models.

1. Language Modeling and Instruction Tuning

Experiments conducted with LLaMA-2 models demonstrated that LQ-LoRA generally outperforms QLoRA and GPTQ-LoRA, especially in aggressive quantization regimes (e.g., sub-3 bits). For instance, the 2.75-bit LQ-LoRA models exhibited minor performance degradation compared to the more memory intensive 4-bit QLoRA while maintaining substantial memory savings.

2. GLUE Benchmark Finetuning

In the GLUE tasks using RoBERTa-Large, LQ-LoRA consistently showed superiority over QLoRA, particularly in the 2.5 to 3.5-bit quantization range. This reinforces the benefit of LQ-LoRA's flexible quantization and data-aware initialization in varied NLP applications.

Discussion and Implications

The LQ-LoRA approach implies significant practical advantages for deploying LLMs in environments constrained by memory and computational resources. By allowing effective fine-tuning even with high-ratio quantization, it broadens accessibility and applicability of advanced language models.

Limitations and Future Work

While effective, the iterative matrix decomposition process remains heuristic, lacking strong theoretical grounding. Future work may focus on more principled optimization algorithms. Extending LQ-LoRA to integrate mixed-rank decomposition could further optimize performance, though recent experiments with hybrid initialization did not show improvement.

Additionally, further exploration in dynamically adjusting the rank and quantization configuration based on downstream task performance would be of interest. The empirical evidence suggests that incorporating more nuanced data-aware strategies could yield even more performant models.

Overall, LQ-LoRA represents a significant contribution to the toolkit for efficient LLM adaptation, emphasizing the importance of informed and dynamic model calibration for practical AI deployments.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.