Emergent Mind

Abstract

Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce APT that adaptively prunes and tunes parameters for the LMs. At the early stage of fine-tuning, APT dynamically adds salient tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency. Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86.4% LLaMA models' performance with 70% parameters remained. Furthermore, APT speeds up LMs fine-tuning by up to 8x and reduces large LMs memory training footprint by up to 70%.

Overview

  • APT is a novel approach focusing on the efficiency of fine-tuning and inference for large Language Models by adaptively pruning and tuning LM parameters.

  • The adaptive pruning in APT dynamically adjusts salient tuning parameters early in training and uses outlier-aware salience scoring to discard unimportant parameters efficiently.

  • APT includes an adaptive tuning process where model parameters are dynamically added based on layer importance, leading to fast LM convergence and restored performance post pruning.

  • Experimental results show APT achieving up to 98% task performance with significant parameter pruning in RoBERTa, T5, and LLaMA models, while offering substantial improvements in training efficiency and memory footprint.

  • APT contributes to improving efficiency in LMs and suggests a potent application in hardware-constrained environments, with future potential in PEFT architectures for better large-scale LM performance.

Overview of APT: Adaptive Pruning and Tuning

Adaptive Pruning and Tuning (APT) is a novel approach introduced to address two critical challenges in fine-tuning and inference of large Language Models (LMs) – the high cost of memory and computational efficiency. The method adaptively prunes and tunes parameters within the LM, aiming to improve model performance and training efficiency significantly.

Adaptive Pruning Strategies

APT's adaptive pruning dynamically adjusts the number of salient tuning parameters in the early stages of model training. By evaluating the salience scoring function against outlier-aware metrics, APT efficiently identifies and discards unimportant parameters, enhancing both training and inference efficiency without compromising accuracy. Unlike previous techniques that either tune a fixed set of parameters or require a fully trained teacher for distillation, APT incorporates outlier-aware salience scoring for proactive pruning. This approach leads to a drastic reduction in training and inference time, particularly noteworthy when training large LMs like LLaMA.

Adaptive Tuning Procedures

APT is not only about pruning but also fine-tuning where the selected model parameters undergo adaptive tuning throughout the fine-tuning phase. The method involves dynamically adding parameters based on layer importance, which is determined by the computed salience. This significantly accelerates LM convergence and recovers model performance after pruning. Unlike static tuning layers, APT’s adaptability ensures that only the most influential layers are enhanced, leading to efficient memory utilization while fine-tuning without necessitating additional computational resources for inference.

Analysis and Comparison

Experiments with APT demonstrate its compelling capabilities. It maintains up to 98% task performance with as much as 40% parameter pruning in RoBERTa and T5 models, and an impressive 86.4% with 70% parameter retention in LLaMA models. When contrasted with benchmark methods like LoRA and structured pruning, APT shows augmented training efficiency, pruning smaller models up to 8 times faster and exhibiting a 70% reduction in memory footprint with large models like LLaMA.

Conclusion

APT represents a significant leap towards enhancing training and inference efficiencies in LMs. Its adaptive pruning and tuning framework play a foundational role in maintaining high model performance with considerably fewer parameters. Moreover, it accelerates convergence and significantly reduces the memory and compute burden, facilitating the practical application of LMs even on hardware with strict limitations. Future research could explore extending APT's paradigm to more sophisticated PEFT architectures, potentially achieving even greater performance recovery in large-scale language models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.