Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

129 tokens/sec

GPT-4o

28 tokens/sec

Gemini 2.5 Pro Pro

42 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

413

Super Consistency of Neural Network Landscapes and Learning Rate Transfer (2402.17457v2)

Published 27 Feb 2024 in cs.LG

Abstract: Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit (\mup and its depth extension), then some hyperparameters -- such as the learning rate -- exhibit transfer from small to very large models. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is consistently similar across very different model sizes. In this work, we study the landscape through the lens of the loss Hessian, with a focus on its largest eigenvalue (i.e. the sharpness), and find that certain spectral properties under $\mu$P are largely independent of the size of the network, and remain consistent as training progresses. We name this property Super Consistency of the landscape. On the other hand, we show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales. But what causes these differences in the sharpness dynamics? Through a connection between the Hessian's and the NTK's spectrum, we argue that the cause lies in the presence (for $\mu$P) or progressive absence (for the NTK scaling) of feature learning. We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from ResNets and Vision Transformers trained on benchmark vision datasets to Transformers-based LLMs trained on WikiText.

References (83)

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates that under μP scaling, neural network sharpness stabilizes independently of width and depth, facilitating seamless learning rate transfer.
It employs both empirical measurements and theoretical analysis to show how feature learning contrasts sharply with NTK dynamics in shaping the loss landscape.
The findings offer practical implications by reducing the need for hyperparameter re-tuning, thus streamlining the training of large-scale deep learning models.

Understanding Hyperparameter Transfer in Deep Learning through the Lens of Sharpness Dynamics

Overview of Key Findings

Recent research has illuminated an intriguing phenomenon in the domain of deep learning: the transfer of learning rates across models of varying sizes without necessitating re-tuning. This paper, titled "Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning," explores the empirical evidence and theoretical grounds for why and how the learning rate, as a hyperparameter, demonstrates transferability when scaling neural networks both in width and depth.

Empirical Evidence and Theoretical Analysis

The research provides a comprehensive analysis of the behavior of the largest eigenvalue of the training loss Hessian — termed as sharpness — across networks of different scales. The focal point of the paper is the understanding of how under certain scaling limits, specifically the \muP and its depth-extension, the sharpness exhibits a surprisingly stable pattern, largely unaffected by the changes in network dimensions through the early and mid phases of the training period. Conversely, under Neural Tangent Kernel (NTK) scaling, sharpness dynamics significantly deviate, showing a clear width-dependence that inhibits learning rate transfer.

Critical Observations

Width/Depth-Independent Sharpness: The paper shows that under \muP scaling, after an initial phase of progressive sharpness increase, the dynamics of the largest Hessian eigenvalue stabilize to a value that is independent of the network's width and depth. This stability correlates with the phenomenon of learning rate transfer.
Comparison with NTK Regime: A contrasting behavior is noted under the NTK regime, where sharpness decreases with increasing width, and consequently, learning rate transfer is not observed.

Underlying Mechanisms

The paper argues that the disparity in sharpness dynamics between \muP and NTK parameterizations can be traced back to the presence or absence of feature learning. Under \muP scaling, feature learning ensures a progressive sharpening that reaches a width and depth-independent threshold, facilitating learning rate transfer. The analysis draws upon the spectra of the Hessian and the NTK matrix, reinforcing the connection between feature learning’s role in shaping the loss landscape's curvature and hyperparameter transferability.

Implications and Future Directions

The findings from this research have both practical and theoretical implications for the field of deep learning. Practically, understanding the conditions under which hyperparameters like the learning rate can be transferred without re-tuning can significantly reduce the computational resources required for training large-scale models. On a theoretical level, this work bridges existing gaps between optimization theory and the empirical phenomena observed in scaling neural networks.

Considering future developments, this paper lays the groundwork for more detailed explorations into the nexus between scaling limits and optimization dynamics in deep learning. It hints at the role of feature learning as a potentially critical factor for ensuring the transferability of not just learning rates but possibly other hyperparameters. Further research in this direction could unravel more aspects of deep learning theory, contributing to the optimization of neural network training processes.

Conclusion

This research represents a significant step towards deciphering the complexities involved in the scaling of deep learning models. By providing a clear explanation of why learning rates can be transferred across models of varying sizes, rooted in the dynamics of sharpness under specific scaling regimes, it offers valuable insights that amalgamate theoretical findings with empirical evidence. As deep learning continues to evolve, such studies will be crucial in guiding the efficient scaling and optimization of neural network models.

PDF Markdown

Tweets

https://twitter.com/lorenzo_noci/status/1763241010501394717

https://twitter.com/lorenzo_noci/status/1866857994346258539

https://twitter.com/cloneofsimo/status/1815753930846126513

https://twitter.com/cloneofsimo/status/1787478283833188766

https://twitter.com/fly51fly/status/1764041741102199237

https://twitter.com/Laz4rz/status/1931377838617215435