SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models (2303.10464v2)

Published 18 Mar 2023 in cs.LG and cs.CL

Abstract: The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in NLP. Instead of directly training on a downstream task, LLMs are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also lead to highly prohibitive computational costs. Pre-training LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity, while retaining the benefits of pre-trained textual representations for downstream tasks.

Citations (11)

View on Semantic Scholar

Summary

The paper introduces the innovative SPDF framework that decouples sparse pre-training from dense fine-tuning to reduce computational costs in LLM training.
It achieves up to 75% sparsity, reducing training FLOPs by 2.5x while maintaining comparable performance on various downstream tasks.
The research highlights a clear link between sparsity levels and task complexity, paving the way for more efficient and scalable LLM training strategies.

Sparse Pre-training and Dense Fine-tuning for LLMs

Overview

The paper "Sparse Pre-training and Dense Fine-tuning for LLMs" presents a novel approach to optimize the training efficiency of LLMs like GPT-3. The authors introduce a method termed Sparse Pre-training and Dense Fine-tuning (SPDF), which involves using unstructured weight sparsity during the pre-training phase to reduce computational costs, followed by dense fine-tuning to recover the model's representational capacity. This strategy aims to address the prohibitive computational costs associated with pre-training large-scale LLMs without significantly compromising the downstream task performance.

Key Contributions

The major contributions of this paper are:

Introduction of SPDF Framework: The paper proposes decoupling the model capacity between pre-training and fine-tuning phases. By inducing up to 75% sparsity in a 1.3B parameter GPT-3 XL model during pre-training, they achieve a 2.5x reduction in training FLOPs.
Experimental Validation: The authors meticulously evaluate their method on several downstream tasks, demonstrating that the SPDF approach retains accuracy comparable to dense models, with an insignificant loss relative to their dense baselines.
Insight into Sparsity and Task Complexity: The paper establishes a correlation between the observed sparsity levels during pre-training and the dataset size and complexity of the downstream tasks, indicating the feasibility of using SPDF across different model sizes and complexities.

Methodology

Sparse Pre-training

Sparse pre-training involves initializing a dense network and then inducing unstructured sparsity to reduce the number of active parameters. The objective is to maintain enough representational capacity during the pre-training phase to capture generalizable features while significantly reducing the computational overhead.

Dense Fine-tuning

During the fine-tuning stage, the zeroed weights from the sparse pre-training phase are allowed to adapt, thus transitioning to a dense weight matrix. This step aims to recover the full representational capacity of the model, enabling it to better perform on specific downstream tasks.

Experimental Setup and Results

The experiments were conducted using two models: GPT-2 Small (125M parameters) and GPT-3 XL (1.3B parameters). Models were pre-trained on The Pile dataset following Chinchilla's scaling law and were fine-tuned on various downstream tasks including natural language generation (E2E, WebNLG, and DART) and text summarization (Curation Corpus).

Performance on Downstream Tasks

The SPDF method showed impressive results:

At 75% sparsity for GPT-3 XL, the delta in BLEU scores for tasks like E2E, WebNLG, and DART were minimal, illustrating the robustness of the method.
For the more complex summarization task (Curation Corpus), the sparsity led to higher perplexity values indicating some performance trade-offs at extreme sparsity levels.

FLOPs Reduction

The approach yielded significant FLOP reductions:

GPT-3 XL at 75% sparsity achieved approximately 2.5x reduction in training FLOPs compared to the dense model.
The reduction in FLOPs was more pronounced in larger models, indicating that SPDF's benefits scale with model size.

Implications and Future Directions

The introduction of SPDF provides practical and theoretical insights into efficient model training:

Practical Implications: The method offers a feasible solution to the increasing computational costs of pre-training large LLMs, fostering more sustainable AI developments.
Theoretical Implications: It opens avenues for further research into optimizing the balance between model sparsity and performance, particularly in how sparsity scales with model complexity and how dynamic sparsity methods might further enhance efficiencies.

Conclusion

The paper successfully demonstrates that sparse pre-training followed by dense fine-tuning can effectively reduce the computational demands of training LLMs while maintaining performance. This work not only provides a scalable solution for training large models but also lays the groundwork for future explorations in sparsity techniques and hardware optimizations for LLMs. Future investigations may delve into dynamic sparsity methods and varying fine-tuning strategies to further enhance the efficiency and scalability of LLM training.

PDF Markdown