How to set AdamW's weight decay as you scale model and dataset size (2405.13698v1)

Published 22 May 2024 in cs.LG and cs.AI

Abstract: We show that weights learned by AdamW can be understood as an exponential moving average (EMA) of recent updates. This gives critical insights for how to set the weight decay in AdamW, and how the weight decay should scale with model and dataset size. In particular, the key hyperparameter for an exponential moving average is the EMA timescale. Intuitively, the EMA timescale can be understood as the number of recent iterations the EMA averages over. Given a fixed learning rate, there is a one-to-one mapping from the EMA timescale to the usual weight decay hyperparameter. Thus, choosing an EMA timescale implicitly sets the weight decay. Importantly, there are natural guidelines for sensible values for the EMA timescale: we need to average over all datapoints, so the EMA timescale should not be (much) smaller than 1 epoch, and we need to forget early updates, so the EMA timescale should not be (much) bigger than the total number of training epochs. In our experiments, we find that optimal EMA timescales are consistent with these guidelines, as are the hyperparameters chosen in recent large-scale LLM pretraining runs (e.g.\ Llama 1+2 and Stable LM). Critically, these guidelines suggest that the optimal EMA timescale should not change (much) as we scale the model and dataset. That implies that as the dataset size increases, the optimal weight decay should fall. Moreover, as the model size increases, the optimal weight decay should also increase (if we follow the muP recommendation for scaling the learning rate).

References (27)

Authors (2)

Xi Wang (275 papers)
Laurence Aitchison (66 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces an EMA timescale guideline linking weight decay and learning rate to provide a theoretical basis for tuning AdamW.
Empirical results reveal that optimal weight decay decreases with larger datasets and increases with model size, following μP scaling principles.
The study offers a practical framework for adjusting hyperparameters in large-scale training, potentially enhancing model performance and efficiency.

Understanding AdamW's Weight Decay Scaling with Model and Dataset Size

The paper, titled "How to set AdamW's weight decay as you scale model and dataset size," explores a nuanced aspect of neural network training—understanding and setting the weight decay hyperparameter in AdamW optimizations. This investigation is anchored in elucidating the relationship between weight decay and model scaling, predominantly in massive LLM training scenarios.

AdamW, an optimization algorithm widely employed due to its incorporation of weight decay, can potentially be examined through the lens of an exponential moving average (EMA). Weight decay in AdamW influences how recent updates are averaged akin to what is observed in EMA frameworks, thus opening avenues for innovative approaches to hyperparameter setting, especially weight decay. The critical insight presented is that the customary weight decay hyperparameter, traditionally adjusted manually or through empirical testing, is interdependently linked with the learning rate via the EMA timescale parameter, $\tau_{\text{iter}} = 1/(\eta \lambda)$ , where $\eta$ is the learning rate, and $\lambda$ is the weight decay.

Key Findings

EMA Timescale as a Guideline: The paper posits that setting a sensible value for the EMA timescale $\tau_{\text{iter}}$ can inherently dictate the weight decay parameter. Practical guidelines suggest that the EMA timescale, measured in epochs, should balance between not being smaller than one epoch nor excessively larger than the total training epochs to effectively average over data points while consistently forgetting the initial updates.
Scalability Across Models and Datasets: A salient implication of the EMA viewpoint is that as the scale of models and datasets varies, the optimal EMA timescale $\tau_{\text{epoch}}$ remains relatively unchanged. Therefore, with increasing dataset sizes, the implied recommendation is a reduction in the optimal weight decay. Conversely, as model sizes enhance, respecting the $\mu P$ scaling for learning rates indicates a necessary increase in weight decay.
Validation Through Experiments: Empirical tests across different model architectures such as ResNet and ViT validate the hypothesis—especially the robustness of the optimal EMA timescale guideline across varied hyperparameter settings. Furthermore, comparisons with configurations used in large-scale model pretraining (e.g., Llama series) underscore the potential applicability of the EMA-derived scaling principles.

Theoretical and Practical Implications

Theoretically, this paper enhances the understanding of AdamW’s role in weight optimization beyond empirical adjustments, embedding it within a theoretically grounded framework linked with EMA processes. This advancement could fuel further exploration into optimizing other hyperparameters potentially governed by exponential averaging concepts. Moreover, the paper's empirical insights may spur developers to systematically adjust their weight decay settings, potentially boosting the efficiency and performance of large-scale model training efforts.

Looking forward, further investigations may venture into leveraging this EMA perspective for other optimization algorithms that incorporate decoupled weight decay mechanisms. Additionally, the exploration and validation of the resulting insights in more diverse model architectures and across a spectrum of tasks can strengthen and refine the practical guidelines offered.

In conclusion, the paper raises important considerations on how model scaling interacts with hyperparameter optimization, predominantly weight decay, thus offering a structured approach to hyperparameter scaling based on theoretical and empirical examinations. Such insights could prove particularly beneficial in refining current practices in model training, optimizing computational resources, and enhancing model performance.

PDF Markdown

Related Papers

Tweets

https://twitter.com/laurence_ai/status/1795366366351864112

https://twitter.com/xidulu/status/1897350141356700084

https://twitter.com/xidulu/status/1895318283370537031

https://twitter.com/laurence_ai/status/1841942820556288020

https://twitter.com/laurence_ai/status/1930529357124055475

https://twitter.com/laurence_ai/status/1919714008367538398