ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization (2311.13171v1)

Published 22 Nov 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Parameter-efficient fine-tuning (PEFT) techniques make it possible to efficiently adapt a LLM to create "expert" models that specialize to new tasks or domains. Recent techniques in model merging and compositional generalization leverage these expert models by dynamically composing modules to improve zero/few-shot generalization. Despite the efficiency of PEFT methods, the size of expert models can make it onerous to retrieve expert models per query over high-latency networks like the Internet or serve multiple experts on a single GPU. To address these issues, we present ComPEFT, a novel method for compressing fine-tuning residuals (task vectors) of PEFT based models. ComPEFT employs sparsification and ternary quantization to reduce the size of the PEFT module without performing any additional retraining while preserving or enhancing model performance. In extensive evaluation across T5, T0, and LLaMA-based models with 200M - 65B parameters, ComPEFT achieves compression ratios of 8x - 50x. In particular, we show that ComPEFT improves with scale - stronger models exhibit higher compressibility and better performance. For example, we show that ComPEFT applied to LLaMA outperforms QLoRA by 4.16% on MMLU with a storage size reduction of up to 26x. In addition, we show that the compressed experts produced by ComPEFT maintain few-shot compositional generalization capabilities, facilitate efficient communication and computation, and exhibit enhanced performance when merged. Lastly, we provide an analysis of different method components, compare it with other PEFT methods, and test ComPEFT's efficacy for compressing the residual of full-finetuning. Our code is available at https://github.com/prateeky2806/compeft.

Authors (4)

Prateek Yadav (24 papers)
Leshem Choshen (78 papers)
Colin Raffel (83 papers)
Mohit Bansal (304 papers)

Citations (11)

View on Semantic Scholar

Summary

The paper introduces ComPEFT, which compresses fine-tuning residuals with sparsification and ternary quantization, achieving compression ratios up to 50x on models like LLaMA.
The methodology enhances performance, with LLaMA-65B showing a 4.16% improvement on the MMLU benchmark while preserving compositional generalization capabilities.
The approach offers a Pareto-optimal trade-off between storage and performance, outperforming baselines in dynamic model merging and large language model fine-tuning scenarios.

ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

The paper "ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization" introduces a novel methodology, ComPEFT, aimed at mitigating the communication overheads associated with parameter-efficient fine-tuning (PEFT) in LLMs. The central innovation of ComPEFT lies in its ability to compress fine-tuning residuals, known as task vectors, using sparsification and ternary quantization techniques. These compressed task vectors facilitate efficient communication and computation without additional retraining, and even enhance model performance in certain scenarios.

Key Contributions and Results

Compression Efficiency: ComPEFT achieves significant compression ratios ranging from 8x to 50x across various models, including T5, T0, and LLaMA with parameters ranging from 200M to 65B. For instance, ComPEFT applied to the LLaMA-65B model achieves a storage size reduction of up to 26 times compared to the original QLoRA checkpoint, while also improving performance by 4.16% on the MMLU benchmark.
Enhanced Performance: The method is shown to improve with scale; stronger models exhibit higher compressibility and better performance post-compression. Specifically, the performance of LLaMA models on the MMLU benchmark improves progressively with model size after applying ComPEFT: 7B (0.54%), 13B (1.06%), 33B (3.44%), and 65B (4.16%).
Preservation of Compositional Generalization: The compressed experts produced by ComPEFT maintain few-shot compositional generalization capabilities. The method is evaluated within the LoraHub framework on the Big-Bench-Hard benchmark and shows that compressed models perform similarly to the original ones in terms of compositional generalization.
Efficient Model Merging: ComPEFT enables better-merged models, outperforming strong baselines like Task Arithmetic and TIES-Merging in the majority of settings. Specifically, ComPEFT-compressed checkpoints result in an average improvement of 1.4% across considered settings.
Pareto-Optimal PEFT: When compared with other PEFT methods, ComPEFT is identified as Pareto-optimal in terms of storage costs versus performance. ComPEFT demonstrates that its compressed (IA) $^3$ applied to LoRA modules requires substantially less storage while maintaining or exceeding the performance of existing methods.

Methodological Insights

ComPEFT's approach involves two main steps:

Sparsification: This step involves resetting a proportion of the task vector parameters to zero, effectively pruning the task vector. The sparsification is conducted based on the magnitude of parameter values, retaining only the most significant updates.
Ternary Quantization: The remaining non-zero elements in the sparsified task vector are quantized into a ternary vector, where each element is either -1, 0, or +1. This ternary vector is then scaled by the standard deviation of the original task vector, and the magnitude is adjusted using a single scalar constant $\alpha$ , which is tuned based on a small validation set.

Implications and Future Directions

Practical Implications: ComPEFT addresses the high communication costs associated with retrieving multiple expert models over high-latency networks. This enhancement is particularly beneficial for scenarios involving dynamic model merging and compositional generalization, where quick and efficient retrieval of model updates is crucial. The method significantly reduces the storage and communication overhead, making it feasible to handle large-scale model updates in practical deployments.

Theoretical Implications: The findings suggest that the intrinsic dimensionality of task vectors in PEFT methods is remarkably low, particularly as model size increases. This implies that even extensive updates in large models may be represented compactly without substantial loss in performance, paving the way for efficient, scalable adaptations of LLMs.

Future Directions: Future research could focus on exploring the application of ComPEFT to a broader array of LLMs and PEFT methods. Further investigation into the theoretical underpinnings of task vector compressibility, especially in relation to model scaling laws, could yield deeper insights. Additionally, the development of more sophisticated quantization and sparsification techniques could further enhance the performance and efficiency of ComPEFT.

Conclusion

ComPEFT presents a sophisticated methodology for compressing parameter updates in fine-tuning scenarios. By leveraging sparsification and ternary quantization, it achieves impressive compression ratios and often enhances model performance. Its practical benefits in terms of efficient communication and storage, coupled with its theoretical implications for understanding the intrinsic dimensionality of task vectors, position ComPEFT as a valuable advancement in the domain of parameter-efficient fine-tuning.

PDF Markdown

Related Papers

GitHub

GitHub - prateeky2806/ComPEFT (26 stars)

Tweets

https://twitter.com/schulzb589/status/1760299580019581249