Emergent Mind

Abstract

While LLMs often adopt finetuning to unlock their capabilities for downstream applications, our understanding on the inductive biases (especially the scaling properties) of different finetuning methods is still limited. To fill this gap, we conduct systematic experiments studying whether and how different scaling factors, including LLM model size, pretraining data size, new finetuning parameter size and finetuning data size, affect the finetuning performance. We consider two types of finetuning -- full-model tuning (FMT) and parameter efficient tuning (PET, including prompt tuning and LoRA), and explore their scaling behaviors in the data-limited regime where the LLM model size substantially outweighs the finetuning data size. Based on two sets of pretrained bilingual LLMs from 1B to 16B and experiments on bilingual machine translation and multilingual summarization benchmarks, we find that 1) LLM finetuning follows a powerbased multiplicative joint scaling law between finetuning data size and each other scaling factor; 2) LLM finetuning benefits more from LLM model scaling than pretraining data scaling, and PET parameter scaling is generally ineffective; and 3) the optimal finetuning method is highly task- and finetuning data-dependent. We hope our findings could shed light on understanding, selecting and developing LLM finetuning methods.

Evaluation of scaling laws for large language model (LLM) size performance.

Overview

  • The paper investigates the effects of model size, pretraining data size, and finetuning methods (Full-Model Tuning and Parameter Efficient Tuning) on the efficiency of finetuning LLMs for tasks like bilingual machine translation and multilingual summarization.

  • It proposes a multiplicative joint scaling law to describe the relationship between finetuning data size and other variables, revealing the superior impact of model size scaling over pretraining data scaling on finetuning performance.

  • Findings indicate Parameter Efficient Tuning methods, despite limited effectiveness in parameter scaling, provide better zero-shot generalization, especially with minimal finetuning data.

  • The study suggests finetuning method selection is task and data-dependent, advising against a one-size-fits-all approach, and highlights the necessity for future research on the impacts of finetuning data quality and multi-modal LLMs.

Exploring the Dynamics of LLM Finetuning Across Scalable Parameters

Introduction to Finetuning Scaling in LLMs

In the rapidly evolving landscape of NLP, leveraging pretrained LLMs for downstream applications has established a new norm, capitalizing on in-context learning and emergent capabilities of models like GPT-4 and PaLM 2. Despite these advances, a systematic understanding of how various factors, particularly model size, pretraining data size, new finetuning parameters, and finetuning data size, influence the effectiveness of finetuning methods remains undeveloped. This gap in knowledge forms the crux of our investigation, focusing on two finetuning approaches: Full-Model Tuning (FMT) and Parameter Efficient Tuning (PET), the latter comprising methods like prompt tuning and Low-rank Option (LoRA).

Methodology and Experimentation

The research conducts a thorough analysis across multiple dimensions, involving LLM model sizes from 1B to 16B parameters and finetuning tasks including bilingual machine translation and multilingual summarization. The essence of this exploration is captured in a proposed multiplicative joint scaling law, articulating a relationship between finetuning data size and other scalars under study, and highlighting:

  • The relative impact of scaling LLM models versus pretraining data on finetuning efficiency.
  • The limited effectiveness of scaling PET parameters.
  • Task and data dependency in the selection of optimal finetuning methods.
  • Enhanced zero-shot generalization to related tasks by PET over FMT.

Key Observations and Findings

The analysis brings forth several intriguing findings:

  • LLM model size scaling significantly surpasses pretraining data scaling in benefitting finetuning performance, underlining the crucial role of model architecture complexity.
  • In the realm of PET parameter scaling, neither prompt tuning length nor LoRA rank scaling demonstrated substantial gains, with LoRA exhibiting better training stability.
  • The study corroborates the task and data-dependent nature of optimal finetuning method selection, arguing against a one-size-fits-all approach.
  • Intriguingly, PET methods, particularly in the face of scant finetuning data, show a stronger propensity for zero-shot generalization, a key consideration for tasks where model flexibility is paramount.

Future Trajectories and Theoretical Implications

This investigation opens several avenues for future research, notably in extending these findings to multi-modal LLMs and understanding the impacts of finetuning data quality. The data-dependent joint scaling law proposed enriches our theoretical comprehension of finetuning dynamics in LLMs, laying groundwork for more optimized, task-specific application of these powerful models.

Concluding Remarks

The in-depth examination underscores the nuanced interplay between model size, data size, and finetuning methods in enhancing LLM performance on downstream tasks. By dissecting these relationships, this study offers vital insights necessary for navigating the complexities of LLM finetuning, poised to influence future NLP research and application strategies significantly.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.