APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference (2401.12200v2)

Published 22 Jan 2024 in cs.CL and cs.LG

Abstract: Fine-tuning and inference with LLMs (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce APT that adaptively prunes and tunes parameters for the LMs. At the early stage of fine-tuning, APT dynamically adds salient tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency. Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86.4% LLaMA models' performance with 70% parameters remained. Furthermore, APT speeds up LMs fine-tuning by up to 8x and reduces large LMs memory training footprint by up to 70%.

Citations (11)

View on Semantic Scholar

Summary

The paper introduces APT’s key contribution of using outlier-aware salience scoring to dynamically prune redundant parameters, reducing computational costs.
The paper details an adaptive tuning procedure that selectively enhances influential layers to speed up convergence and optimize memory utilization.
The paper demonstrates that APT achieves up to 98% task performance with 40% parameter pruning and reduces training time by up to 8 times compared to existing methods.

Overview of APT: Adaptive Pruning and Tuning

Adaptive Pruning and Tuning (APT) is a novel approach introduced to address two critical challenges in fine-tuning and inference of LLMs (LMs) – the high cost of memory and computational efficiency. The method adaptively prunes and tunes parameters within the LM, aiming to improve model performance and training efficiency significantly.

Adaptive Pruning Strategies

APT's adaptive pruning dynamically adjusts the number of salient tuning parameters in the early stages of model training. By evaluating the salience scoring function against outlier-aware metrics, APT efficiently identifies and discards unimportant parameters, enhancing both training and inference efficiency without compromising accuracy. Unlike previous techniques that either tune a fixed set of parameters or require a fully trained teacher for distillation, APT incorporates outlier-aware salience scoring for proactive pruning. This approach leads to a drastic reduction in training and inference time, particularly noteworthy when training large LMs like LLaMA.

Adaptive Tuning Procedures

APT is not only about pruning but also fine-tuning where the selected model parameters undergo adaptive tuning throughout the fine-tuning phase. The method involves dynamically adding parameters based on layer importance, which is determined by the computed salience. This significantly accelerates LM convergence and recovers model performance after pruning. Unlike static tuning layers, APT’s adaptability ensures that only the most influential layers are enhanced, leading to efficient memory utilization while fine-tuning without necessitating additional computational resources for inference.

Analysis and Comparison

Experiments with APT demonstrate its compelling capabilities. It maintains up to 98% task performance with as much as 40% parameter pruning in RoBERTa and T5 models, and an impressive 86.4% with 70% parameter retention in LLaMA models. When contrasted with benchmark methods like LoRA and structured pruning, APT shows augmented training efficiency, pruning smaller models up to 8 times faster and exhibiting a 70% reduction in memory footprint with large models like LLaMA.

Conclusion

APT represents a significant leap towards enhancing training and inference efficiencies in LMs. Its adaptive pruning and tuning framework play a foundational role in maintaining high model performance with considerably fewer parameters. Moreover, it accelerates convergence and significantly reduces the memory and compute burden, facilitating the practical application of LMs even on hardware with strict limitations. Future research could explore extending APT's paradigm to more sophisticated PEFT architectures, potentially achieving even greater performance recovery in large-scale LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1749767478853693810

https://twitter.com/BowenROIM/status/1815455985562984644

https://twitter.com/gm8xx8/status/1749629775398736043