Parameter-Efficient Transfer Learning with Diff Pruning (2012.07463v2)

Published 14 Dec 2020 in cs.CL and cs.LG

Abstract: While task-specific finetuning of pretrained networks has led to significant empirical advances in NLP, the large size of networks makes finetuning difficult to deploy in multi-task, memory-constrained settings. We propose diff pruning as a simple approach to enable parameter-efficient transfer learning within the pretrain-finetune framework. This approach views finetuning as learning a task-specific diff vector that is applied on top of the pretrained parameter vector, which remains fixed and is shared across different tasks. The diff vector is adaptively pruned during training with a differentiable approximation to the L0-norm penalty to encourage sparsity. Diff pruning becomes parameter-efficient as the number of tasks increases, as it requires storing only the nonzero positions and weights of the diff vector for each task, while the cost of storing the shared pretrained model remains constant. It further does not require access to all tasks during training, which makes it attractive in settings where tasks arrive in stream or the set of tasks is unknown. We find that models finetuned with diff pruning can match the performance of fully finetuned baselines on the GLUE benchmark while only modifying 0.5% of the pretrained model's parameters per task.

Authors (3)

Demi Guo (11 papers)
Alexander M. Rush (115 papers)
Yoon Kim (92 papers)

Citations (340)

View on Semantic Scholar

Summary

The paper introduces diff pruning, a method that adds a sparse task-specific diff vector to pretrained models, altering only 0.5% of parameters per task.
It demonstrates that diff pruning matches full finetuning performance on benchmarks like GLUE and SQuAD using minimal additional parameters.
The approach enables efficient dynamic task adaptation for storage-constrained environments and paves the way for future advances in model sparsity and compression.

Parameter-Efficient Transfer Learning with Diff Pruning: A Detailed Analysis

This essay provides an in-depth examination of the paper "Parameter-Efficient Transfer Learning with Diff Pruning," authored by Demi Guo, Alexander M. Rush, and Yoon Kim. The paper presents a novel method named "diff pruning," which addresses the limitations of deploying large pretrained networks in storage-constrained environments by enabling parameter-efficient transfer learning across multiple tasks.

Overview and Methodology

Diff pruning emerges as a solution to the challenges posed by the traditional task-specific fine-tuning of pretrained deep networks. This conventional approach is often infeasible in environments where memory and storage are limited, particularly when the same pretrained model is to be adapted for numerous tasks simultaneously. The diff pruning strategy circumvents this problem by learning a task-specific difference vector, referred to as the "diff" vector, which extends the pretrained model's parameters instead of altering them directly.

The task-specific diff vector is optimized with an approximation to the $L_0$ -norm penalty to ensure parameter sparsity. This formulation is strategic for scenarios where tasks are introduced dynamically, such as on-device applications where models must be adaptable without comprehensive retraining for each task. Notably, the authors demonstrate that diff pruning achieves performance parity with fully finetuned models on established benchmarks like GLUE, while only altering a minimal 0.5% of the pretrained model's parameters for each task. This is a substantial improvement in efficiency compared to existing parameter-efficient approaches such as Adapters.

Experimental Results and Implications

The empirical evaluation of diff pruning spans a variety of NLP tasks within the GLUE benchmark and the SQuAD question answering dataset. The methodology was tested against several baselines, including full finetuning and adapter-based approaches, with results consistently showing that diff pruning maintains competitive accuracy with significantly fewer additional parameters per task.

The findings imply that dynamic task adaptation can be achieved without the overhead traditionally associated with deep model finetuning. Moreover, diff pruning provides a viable pathway for the practical deployment of LLMs in settings where storage is constrained and tasks are dynamically introduced.

Theoretical and Practical Implications

From a theoretical standpoint, diff pruning represents a convergence of efficient learning methodologies, borrowing principles from both model compression and transfer learning. The use of differentiable approximations to the $L_0$ -norm penalty introduces a nuanced method of enforcing sparsity, which could have broader implications for sparsity control in deep learning. Practically, this method offers a scalable, adaptable solution for deploying pretrained models across a spectrum of tasks without necessitating extensive computational resources or retraining over the entire dataset.

Future Directions

Looking forward, there are several promising avenues for future research rooted in the insights garnered from diff pruning. One potential area is the integration of diff pruning with ongoing developments in model compression and quantization techniques, which could further optimize the balance between model size, performance, and inference speed. Additionally, exploring the applicability of diff pruning across other domains, such as computer vision or speech processing, could reveal new dimensions to this approach.

Ultimately, the introduction of diff pruning marks a crucial step towards the sustainable deployment of large-scale neural architectures in heterogeneous task environments, offering a blueprint for future advancements in parameter-efficient model adaptation.

PDF Markdown