Emergent Mind

How fine can fine-tuning be? Learning efficient language models

(2004.14129)
Published Apr 24, 2020 in cs.CL , cs.LG , and stat.ML

Abstract

State-of-the-art performance on language understanding tasks is now achieved with increasingly large networks; the current record holder has billions of parameters. Given a language model pre-trained on massive unlabeled text corpora, only very light supervised fine-tuning is needed to learn a task: the number of fine-tuning steps is typically five orders of magnitude lower than the total parameter count. Does this mean that fine-tuning only introduces small differences from the pre-trained model in the parameter space? If so, can one avoid storing and computing an entire model for each task? In this work, we address these questions by using Bidirectional Encoder Representations from Transformers (BERT) as an example. As expected, we find that the fine-tuned models are close in parameter space to the pre-trained one, with the closeness varying from layer to layer. We show that it suffices to fine-tune only the most critical layers. Further, we find that there are surprisingly many good solutions in the set of sparsified versions of the pre-trained model. As a result, fine-tuning of huge language models can be achieved by simply setting a certain number of entries in certain layers of the pre-trained parameters to zero, saving both task-specific parameter storage and computational cost.

Overview

  • The paper examines the effectiveness of fine-tuning pre-trained language models such as BERT for various tasks, focusing on the computational efficiency of this process.

  • Significant findings include the closeness of fine-tuned models to the original pre-trained versions and the identification of specific layers that undergo substantial changes during fine-tuning.

  • It proposes methods like $L_0$-close fine-tuning and supermask training, demonstrating that sparsifying models can maintain performance while reducing computational costs.

Overview of "How fine can fine-tuning be? Learning efficient language models"

The paper, titled "How fine can fine-tuning be? Learning efficient language models," authored by Evani Radiya-Dixit and Xin Wang, investigates the efficacy and efficiency of fine-tuning pre-trained language models, specifically BERT, for various downstream tasks. The research encapsulates the need for efficient computational strategies amid the ever-growing size of state-of-the-art language models.

Motivation and Problem Statement

The advent of large-scale language models such as BERT, GPT-2, and Megatron-LM has enabled remarkable performance in natural language understanding tasks. These models, pre-trained on vast corpora, achieve task-specific learning via fine-tuning. However, while fine-tuning occurs with computation orders of magnitude smaller than the total number of model parameters, it raises two primary questions:

  1. Does fine-tuning introduce minute changes in parameter space relative to pre-trained models?
  2. Can computational costs be reduced by leveraging these minor parameter modifications?

Main Contributions

The paper contributes the following insights and methodologies to address the outlined questions:

  1. Parameter Closeness: The authors quantify that fine-tuned models remain close in parameter space to pre-trained models, showing minimal $L_1$ and angular distances. This closeness is consistent with the small number of fine-tuning iterations compared to the parameter count.
  2. Layer Sensitivity: Through empirical analysis, the paper highlights that only specific layers undergo significant changes during fine-tuning, suggesting that fine-tuning can be efficiently achieved by focusing on these critical layers.
  3. Sparsified Models: The study reveals that good task-specific performance can be attained by sparsifying a pre-trained model’s weights. Sparse fine-tuning involves setting a certain fraction of weights in selected layers to zero, thereby maintaining performance while reducing computational costs.

Methodology

The paper`s empirical framework focuses on BERT and evaluates results using the General Language Understanding Evaluation (GLUE) benchmark. Several methods were proposed:

  1. $L_0$-Close Fine-Tuning: The approach involves fixing the least sensitive parameters, thereby reducing the number of parameters fine-tuned and resulting in models computationally closer to the original pre-trained models.
  2. Supermask Training: This method uses binary masks to prune the pre-trained model’s parameters selectively. The masks are optimized to find configurations that, despite being sparsified, still perform well on downstream tasks.

Results

The study provides robust numerical evidence supporting its claims:

  1. Close Parameter Space: Fine-tuned models exhibited $L_1$ distances in the range of $[0.1, 3.3]$ from pre-trained models, and angular distances ranging from $[0.001, 0.027]$, significantly lower than random initializations.
  2. Efficient Fine-Tuning: By excluding several layers during fine-tuning, up to 40% reduction in task-specific parameter count was achieved with marginal performance degradation.
  3. Sparse Fine-Tuning Effectiveness: Fine-tuning with up to 40% sparsity in crucial layers incurred slight performance degradation, whereas more aggressive sparsification yielded models with slightly lower but acceptable downstream task performance.

Implications

The implications of this work span both theoretical and practical domains:

  1. Theoretical Implications: The results suggest that the parameter landscape of pre-trained models contains numerous local optima that are close to the pre-trained configuration but effective for different tasks. This challenges conventional sensitivity assumptions regarding model parameters and opens new directions in model optimization landscapes.
  2. Practical Implications: From a practical standpoint, the reduction in computational costs and parameter storage makes fine-tuning more sustainable and accessible, facilitating broader deployment of LLMs in resource-constrained environments.

Future Directions

Given the findings, future research could:

  • Investigate layer-specific interactions in larger model architectures to identify optimal fine-tuning strategies.
  • Explore the implications of sparse fine-tuning across other domains beyond natural language processing, such as computer vision and reinforcement learning.
  • Develop more sophisticated mask optimization techniques that can further improve performance or reduce sparsity requirements.

Conclusion

The paper "How fine can fine-tuning be? Learning efficient language models" provides significant insights into efficient computational strategies for fine-tuning pre-trained language models. By highlighting the feasibility of limited parameter adjustments and the efficacy of sparsification, this research offers valuable contributions to the optimization and deployment of LLMs.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.