Emergent Mind

Abstract

Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit ($\mu$P and its depth extension), then some hyperparameters - such as the learning rate - exhibit transfer from small to very large models, thus reducing the cost of hyperparameter tuning. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is remarkably consistent across very different model sizes. In this work, we find empirical evidence that learning rate transfer can be attributed to the fact that under $\mu$P and its depth extension, the largest eigenvalue of the training loss Hessian (i.e. the sharpness) is largely independent of the width and depth of the network for a sustained period of training time. On the other hand, we show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer. But what causes these differences in the sharpness dynamics? Through a connection between the spectra of the Hessian and the NTK matrix, we argue that the cause lies in the presence (for $\mu$P) or progressive absence (for the NTK regime) of feature learning, which results in a different evolution of the NTK, and thus of the sharpness. We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from ResNets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on WikiText

Learning rate transfer shown in $\mu$P, contrasting with NTP's sharpness and width dynamics in training.

Overview

  • The paper investigates the transferability of learning rates across differently scaled neural networks through the lens of 'sharpness dynamics'—the behavior of the largest eigenvalue of the training loss Hessian.

  • It contrasts sharpness behavior under extbackslashmuP scaling and NTK scaling, finding stable sharpness under the former, enabling learning rate transfer, and variable sharpness under the latter.

  • The role of feature learning in maintaining stable sharpness dynamics and thus facilitating the transfer of learning rates is highlighted.

  • The implications of these findings are discussed, stressing the practical benefits of reduced computational resources and highlighting future research directions in deep learning optimization and scaling.

Understanding Hyperparameter Transfer in Deep Learning through the Lens of Sharpness Dynamics

Overview of Key Findings

Recent research has illuminated an intriguing phenomenon in the domain of deep learning: the transfer of learning rates across models of varying sizes without necessitating re-tuning. This study, titled "Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning," explore the empirical evidence and theoretical grounds for why and how the learning rate, as a hyperparameter, demonstrates transferability when scaling neural networks both in width and depth.

Empirical Evidence and Theoretical Analysis

The research provides a comprehensive analysis of the behavior of the largest eigenvalue of the training loss Hessian — termed as sharpness — across networks of different scales. The focal point of the study is the understanding of how under certain scaling limits, specifically the \muP and its depth-extension, the sharpness exhibits a surprisingly stable pattern, largely unaffected by the changes in network dimensions through the early and mid phases of the training period. Conversely, under Neural Tangent Kernel (NTK) scaling, sharpness dynamics significantly deviate, showing a clear width-dependence that inhibits learning rate transfer.

Critical Observations

  • Width/Depth-Independent Sharpness: The study shows that under \muP scaling, after an initial phase of progressive sharpness increase, the dynamics of the largest Hessian eigenvalue stabilize to a value that is independent of the network's width and depth. This stability correlates with the phenomenon of learning rate transfer.
  • Comparison with NTK Regime: A contrasting behavior is noted under the NTK regime, where sharpness decreases with increasing width, and consequently, learning rate transfer is not observed.

Underlying Mechanisms

The paper argues that the disparity in sharpness dynamics between \muP and NTK parameterizations can be traced back to the presence or absence of feature learning. Under \muP scaling, feature learning ensures a progressive sharpening that reaches a width and depth-independent threshold, facilitating learning rate transfer. The analysis draws upon the spectra of the Hessian and the NTK matrix, reinforcing the connection between feature learning’s role in shaping the loss landscape's curvature and hyperparameter transferability.

Implications and Future Directions

The findings from this research have both practical and theoretical implications for the field of deep learning. Practically, understanding the conditions under which hyperparameters like the learning rate can be transferred without re-tuning can significantly reduce the computational resources required for training large-scale models. On a theoretical level, this work bridges existing gaps between optimization theory and the empirical phenomena observed in scaling neural networks.

Considering future developments, this study lays the groundwork for more detailed explorations into the nexus between scaling limits and optimization dynamics in deep learning. It hints at the role of feature learning as a potentially critical factor for ensuring the transferability of not just learning rates but possibly other hyperparameters. Further research in this direction could unravel more aspects of deep learning theory, contributing to the optimization of neural network training processes.

Conclusion

This research represents a significant step towards deciphering the complexities involved in the scaling of deep learning models. By providing a clear explanation of why learning rates can be transferred across models of varying sizes, rooted in the dynamics of sharpness under specific scaling regimes, it offers valuable insights that amalgamate theoretical findings with empirical evidence. As deep learning continues to evolve, such studies will be crucial in guiding the efficient scaling and optimization of neural network models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.