A Dynamical Model of Neural Scaling Laws (2402.01092v4)
Abstract: On a variety of tasks, the performance of neural networks predictably improves with training time, dataset size and model size across many orders of magnitude. This phenomenon is known as a neural scaling law. Of fundamental importance is the compute-optimal scaling law, which reports the performance as a function of units of compute when choosing model sizes optimally. We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. This reproduces many observations about neural scaling laws. First, our model makes a prediction about why the scaling of performance with training time and with model size have different power law exponents. Consequently, the theory predicts an asymmetric compute-optimal scaling rule where the number of training steps are increased faster than model parameters, consistent with recent empirical observations. Second, it has been observed that early in training, networks converge to their infinite-width dynamics at a rate $1/\textit{width}$ but at late time exhibit a rate $\textit{width}{-c}$, where $c$ depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
- The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pp. 74–84. PMLR, 2020a.
- Understanding double descent requires a fine-grained bias-variance decomposition. Advances in neural information processing systems, 33:11022–11032, 2020b.
- High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.
- Getting vit in shape: Scaling laws for compute-optimal model design. arXiv preprint arXiv:2305.13035, 2023.
- A theory for emergence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023.
- Neural networks as kernel learners: The silent alignment effect. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=1NvflqAdoom.
- The onset of variance-limited behavior for networks in the lazy and rich regimes. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=JLINxPOVTh7.
- Explaining neural scaling laws. arXiv preprint arXiv:2102.06701, 2021.
- Learning curves for sgd on structured features. arXiv preprint arXiv:2106.02713, 2021.
- Learning curves for SGD on structured features. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=WPI2vbkAl3Q.
- Self-consistent dynamical field theory of kernel evolution in wide neural networks. arXiv preprint arXiv:2205.09653, 2022b.
- Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. arXiv preprint arXiv:2304.03408, 2023.
- Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pp. 1024–1034. PMLR, 2020.
- Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit, 2023.
- Broken neural scaling laws. arXiv preprint arXiv:2210.14891, 2022.
- Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications, 12(1):2914, 2021.
- Fast rates for regularized least-squares algorithm. 2005.
- Dimension free ridge regression. arXiv preprint arXiv:2210.08571, 2022.
- On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
- Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research, 13(1):795–828, 2012.
- Path integral approach to random neural networks. Physical Review E, 98(6):062120, 2018.
- Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131–10143, 2021.
- Asymptotics of wide networks from feynman diagrams. In International Conference on Learning Representations, 2020.
- Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Scaling description of generalization with number of parameters in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2020(2):023401, 2020.
- The three stages of learning dynamics in high-dimensional kernel methods. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=EQmAP4F859.
- Surprises in high-dimensional ridgeless least squares interpolation. Annals of statistics, 50(2):949, 2022.
- Statistical field theory for neural networks, volume 970. Springer, 2020.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Long, P. M. Properties of the after kernel. arXiv preprint arXiv:2105.10585, 2021.
- Learning curves of generic features maps for realistic datasets with a teacher-student model. Advances in Neural Information Processing Systems, 34:18137–18151, 2021.
- Fluctuations, bias, variance & ensemble of learners: Exact asymptotics for convex losses in high-dimension. In International Conference on Machine Learning, pp. 14283–14314. PMLR, 2022.
- A solvable model of neural scaling laws. arXiv preprint arXiv:2210.16859, 2022.
- Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models. In international conference on machine learning, pp. 4333–4342. PMLR, 2019.
- Statistical dynamics of classical systems. Physical Review A, 8(1):423, 1973.
- The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
- The quantization model of neural scaling. arXiv preprint arXiv:2303.13506, 2023.
- The effective noise of stochastic gradient descent. Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083405, 2022.
- Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification. Advances in Neural Information Processing Systems, 33:9540–9550, 2020.
- The deep bootstrap framework: Good online learners are good offline generalizers. In International Conference on Learning Representations, 2021a.
- The deep bootstrap framework: Good online learners are good offline generalizers. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=guetrIHLFGI.
- Sgd in the large: Average-case analysis, asymptotics, and stepsize criticality. In Conference on Learning Theory, pp. 3548–3626. PMLR, 2021.
- The principles of deep learning theory. Cambridge University Press Cambridge, MA, USA, 2022.
- Learning curves for noisy heterogeneous feature-subsampled ridge ensembles. ArXiv, 2023.
- The eigenlearning framework: A conservation law perspective on kernel regression and wide neural networks. arXiv preprint arXiv:2110.03922, 2021.
- More is better in modern machine learning: when infinite overparameterization is optimal and overfitting is obligatory. arXiv preprint arXiv:2311.14646, 2023.
- Dynamic theory of the spin-glass phase. Physical Review Letters, 47(5):359, 1981.
- Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
- Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124001, 2020.
- Last iterate convergence of sgd for least-squares in the interpolation regime. Advances in Neural Information Processing Systems, 34:21581–21591, 2021.
- Limitations of the ntk for understanding generalization in deep learning. arXiv preprint arXiv:2206.10012, 2022.
- Feature-learning networks are consistent across widths at realistic scales. arXiv preprint arXiv:2305.18411, 2023.
- Tuning large neural networks via zero-shot hyperparameter transfer. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Bx6qKuBM2AD.
- Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Learning curves for deep structured gaussian feature models, 2023.
- Contrasting random and learned features in deep bayesian linear regression. Physical Review E, 105(6):064118, 2022.