Why Do We Need Weight Decay in Modern Deep Learning? (2310.04415v2)

Published 6 Oct 2023 in cs.LG

Abstract: Weight decay is a broadly used technique for training state-of-the-art deep networks from image classification to LLMs. Despite its widespread usage and being extensively studied in the classical literature, its role remains poorly understood for deep learning. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For deep networks on vision tasks trained with multipass SGD, we show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD via the loss stabilization mechanism. In contrast, for LLMs trained with nearly one-epoch training, we describe how weight decay balances the bias-variance tradeoff in stochastic optimization leading to lower training loss and improved training stability. Overall, we present a unifying perspective from ResNets on vision tasks to LLMs: weight decay is never useful as an explicit regularizer but instead changes the training dynamics in a desirable way. The code is available at https://github.com/tml-epfl/why-weight-decay

References (51)

Citations (17)

View on Semantic Scholar

Summary

The paper demonstrates that weight decay enhances implicit regularization in overparameterized networks via loss stabilization from SGD noise.
It reveals that for large language models, weight decay balances the bias-variance tradeoff resulting in lower training loss.
The study shows that weight decay prevents sudden loss divergences in bfloat16 mixed-precision training, ensuring model stability.

Overview of Weight Decay in Modern Deep Learning

The paper, "Why Do We Need Weight Decay in Modern Deep Learning?" explores the multifaceted role of weight decay in training contemporary neural networks. Notably, it challenges the classical view of weight decay as a mere regularization tool, proposing instead that it significantly modifies optimization dynamics across a spectrum of deep learning tasks.

Main Contributions

The authors explore weight decay's function through a detailed empirical paper and present a theoretical framework to understand its effects. The paper's main contributions are as follows:

Enhanced Implicit Regularization: For overparameterized networks, weight decay is shown to influence the optimization dynamics by enhancing the implicit regularization effect of the stochastic gradient descent (SGD) noise. This is achieved through a process termed "loss stabilization."
Role in LLMs: In contrast to overparameterized models, for LLMs trained with nearly one-pass SGD, weight decay does not serve as a traditional regularizer. Instead, it balances the bias-variance tradeoff in stochastic optimization, leading to a reduction in training loss.
Prevention of Divergences: The paper also highlights an unexpected benefit of weight decay in preventing sudden loss divergences during bfloat16 mixed-precision training—crucial for scalable LLM training.

Empirical Analysis and Insights

The paper presents compelling empirical evidence supporting the view of weight decay as a tool that modifies training dynamics favorably:

Loss Stabilization Mechanism: In overparameterized networks, weight decay alters the effective learning rate, allowing the model to capitalize on the implicit noise-driven regularization of SGD. The paper provides evidence through experiments on VGG and ResNet models across CIFAR-10/100 datasets.
Optimization and Stability in LLMs: For LLMs, the paper reproduces empirical findings showing that weight decay leads to lower training loss, especially towards the end of the training period, when paired with decaying learning rates.

Theoretical Implications

The authors propose a unifying theory to explain these observations:

Regularization via Hessian Trace: A central conjecture posits that weight decay modifies optimization trajectories such that the dynamic of SGD closely aligns with a process that regularizes the trace of the Hessian. This results in better generalization performance.
Effective Learning Rate: The work suggests that weight decay alters the effective learning rates through controlling parameter norms, thereby implicitly adjusting the learning rate schedule, especially in LLMs.

Future Perspectives and Practical Takeaways

This paper opens several avenues for future research and practical improvements in AI:

AI Model Training: Understanding weight decay as a dynamic optimizer rather than a static regularizer could inform better hyperparameter tuning and optimization strategies in future AI research.
Wider Applications: The insights into how weight decay prevents divergences in mixed-precision training can be crucial for developing more robust large-scale AI models.
Refinement of LLM Training: The bias-variance tradeoff analysis presents opportunities for developing more efficient training protocols for LLMs, potentially informing new adaptive learning algorithms.

In conclusion, this paper reframes the traditional understanding of weight decay within the deep learning community, offering a nuanced perspective that aligns its usage with improved training dynamics and model stability.

PDF Markdown

Related Papers

GitHub

GitHub - tml-epfl/why-weight-decay: Why Do We Need Weight Decay in Modern Deep Learning? [arXiv, Oct 2023] (66 stars)

Tweets

https://twitter.com/maksym_andr/status/1857197358003417489

https://twitter.com/cloneofsimo/status/1828475322712629483

https://twitter.com/dvruette/status/1879470298531271166

https://twitter.com/maksym_andr/status/1875805861794476323

https://twitter.com/maksym_andr/status/1793564914365215092

https://twitter.com/xidulu/status/1917994603774451719

YouTube

Show All Videos