Papers
Topics
Authors
Recent
2000 character limit reached

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training (2305.14342v4)

Published 23 May 2023 in cs.LG, cs.CL, and math.OC

Abstract: Given the massive cost of LLM pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time. Theoretically, we show that Sophia, in a much simplified setting, adapts to the heterogeneous curvatures in different parameter dimensions, and thus has a run-time bound that does not depend on the condition number of the loss.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Memory efficient adaptive optimization. Advances in Neural Information Processing Systems, 32, 2019.
  2. Scalable second order optimization for deep learning. arXiv preprint arXiv:2002.09018, 2020.
  3. Distributed second-order optimization using kronecker-factored approximations. In International Conference on Learning Representations, 2017.
  4. Dissecting adam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning, pp. 404–413. PMLR, 2018.
  5. Bartlett, M. Approximate confidence intervals. Biometrika, 40(1/2):12–19, 1953.
  6. Improving the convergence of back-propagation learning with. 1988.
  7. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pp. 560–569. PMLR, 2018.
  8. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
  9. Practical gauss-newton optimisation for deep learning. In International Conference on Machine Learning, pp. 557–565. PMLR, 2017.
  10. Convex optimization. Cambridge university press, 2004.
  11. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  12. Rprop: a fast adaptive learning algorithm. In Proceedings of the International Symposium on Computer and Information Science VII, 1992.
  13. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  14. Broyden, C. G. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76–90, 1970.
  15. Improved preconditioner for hessian free optimization. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, volume 201. Citeseer, 2011.
  16. Chen, P. Hessian matrix vs. gauss–newton hessian matrix. SIAM Journal on Numerical Analysis, 49(4):1417–1435, 2011.
  17. Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675, 2023.
  18. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  19. Trust-region methods, siam. MPS, Philadelphia, 2000.
  20. Robustness to unbounded smoothness of generalized signsgd. arXiv preprint arXiv:2208.11195, 2022.
  21. Numerical methods for unconstrained optimization and nonlinear equations. SIAM, 1996.
  22. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  23. Dozat, T. Incorporating nesterov momentum into adam. 2016.
  24. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  25. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  26. On the promise of the stochastic generalized gauss-newton method for training dnns. arXiv preprint arXiv:2006.02409, 2020.
  27. Fast approximate natural gradient descent in a kronecker factored eigenbasis. Advances in Neural Information Processing Systems, 31, 2018.
  28. An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning, pp. 2232–2241. PMLR, 2019.
  29. Openwebtext corpus, 2019.
  30. Grosse, R. Neural Network Training Dynamics. 2022.
  31. A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pp. 573–582. PMLR, 2016.
  32. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pp. 1842–1850. PMLR, 2018.
  33. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  34. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8):2, 2012.
  35. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  36. Hutchinson, M. F. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–1076, 1989.
  37. How to train bert with an academic budget. arXiv preprint arXiv:2104.07705, 2021.
  38. Doubly adaptive scaled algorithm for machine learning using second-order information. arXiv preprint arXiv:2109.05198, 2021.
  39. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  40. Mistral – a journey towards reproducible language model training. https://crfm.stanford.edu/2021/08/26/mistral.html, 2021.
  41. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  42. Limitations of the empirical fisher approximation for natural gradient descent. Advances in neural information processing systems, 32, 2019.
  43. Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be. arXiv preprint arXiv:2304.13960, 2023.
  44. Understanding the difficulty of training transformers. arXiv preprint arXiv:2004.08249, 2020.
  45. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  46. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  47. Stability and convergence of stochastic gradient clipping: Beyond lipschitz continuity and smoothness. In International Conference on Machine Learning, pp. 7325–7335. PMLR, 2021.
  48. Martens, J. New insights and perspectives on the natural gradient method. The Journal of Machine Learning Research, 21(1):5776–5851, 2020.
  49. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417. PMLR, 2015.
  50. Kronecker-factored curvature approximations for recurrent neural networks. In International Conference on Learning Representations, 2018.
  51. Martens, J. et al. Deep learning via hessian-free optimization. In ICML, volume 27, pp.  735–742, 2010.
  52. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
  53. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
  54. OpenAI. Gpt-4 technical report. arXiv, 2023.
  55. Iterative solution of nonlinear equations in several variables. SIAM, 2000.
  56. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
  57. Pytorch: An imperative style, high-performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp.  8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
  58. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  59. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  60. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020.
  61. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
  62. Improved bounds on sample size for implicit matrix trace estimators. Foundations of Computational Mathematics, 15(5):1187–1212, 2015.
  63. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.
  64. A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  9481–9488, 2021.
  65. No more pesky learning rates. In International conference on machine learning, pp. 343–351. PMLR, 2013.
  66. Schraudolph, N. N. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, 14(7):1723–1738, 2002.
  67. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
  68. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  69. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  70. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
  71. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
  72. The implicit and explicit regularization effects of dropout. arXiv preprint arXiv:2002.12915, 2020.
  73. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  74. Pyhessian: Neural networks through the lens of the hessian. In 2020 IEEE international conference on big data (Big data), pp.  581–590. IEEE, 2020.
  75. Adahessian: An adaptive second order optimizer for machine learning. In proceedings of the AAAI conference on artificial intelligence, volume 35, pp.  10665–10673, 2021.
  76. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
  77. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.
  78. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383–15393, 2020.
  79. Eva: Practical second-order optimization with kronecker-vectorized approximation. In The Eleventh International Conference on Learning Representations, 2022a.
  80. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022b.
  81. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in neural information processing systems, 33:18795–18806, 2020.
Citations (104)

Summary

  • The paper introduces Sophia, a scalable second-order optimizer that uses diagonal Hessian estimates to reduce computational costs in language model pre-training.
  • Sophia utilizes both Hutchinson’s and Gauss-Newton-Bartlett estimators to efficiently approximate curvature, lowering training iterations and runtime.
  • Experimental results demonstrate a 2x speedup over Adam, highlighting improved scalability and robustness across various model sizes.

Sophia: A Scalable Stochastic Second-order Optimizer for LLM Pre-training

This essay explores the mechanics and implications of "Sophia: A Scalable Stochastic Second-order Optimizer for LLM Pre-training" (2305.14342). The paper presents Sophia, a second-order optimization algorithm designed to enhance the efficiency of LLM pre-training. The optimizer leverages the advantages of second-order methods while maintaining computational efficiency akin to first-order methods.

Introduction

Sophia addresses the burgeoning challenge of high computational costs in LLM pre-training. Traditional optimizers like Adam have dominated the landscape due to their balance between computational demand and performance. However, the intrinsic limitations of first-order methods often lead to inefficiencies as model sizes and datasets grow. Sophia introduces a second-order approach that utilizes a lightweight estimate of the diagonal Hessian, thus offering a significant reduction in both the number of iterations and total computational resources required.

Methodology

Sophia employs a diagonal Hessian as a pre-conditioner, updated every few iterations to control computational overheads. This approach minimizes the update variances and effectively addresses sharp curvature issues across different parameter dimensions. The optimizer's update rule is defined by the equation:

θt+1=θtηtclip(mt/max{γht,ϵ},1)\theta_{t+1} = \theta_t - \eta_t \cdot \text{clip}(m_t / \max\{\gamma \cdot h_t, \epsilon\}, 1)

with θt\theta_t being the model parameters at step tt, mtm_t the moving average of gradients, hth_t the estimated Hessian diagonal, and γ\gamma, ϵ\epsilon tuning parameters to prevent extreme updates. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Comparison of numbers of steps to reach the same validation loss. Across all model sizes, Sophia achieves significant speedup.

The algorithm supports two Hessian estimators: Hutchinson's unbiased estimator, which provides an unbiased diagonal approximation using Hessian-vector products, and the Gauss-Newton-Bartlett (GNB) estimator, which leverages structure in neural loss functions for a more deterministic diagonal approximation.

Experimental Results

The paper demonstrates that Sophia achieves a 2x speed-up over Adam in the number of steps, total computation, and wall-clock time, maintaining performance across model sizes from 125M to 1.5B parameters. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Validation loss on OpenWebText. Compared to AdamW and Lion, Sophia achieves lower loss across all model sizes.

One key finding is the scaling law: as model sizes increase, the gap in efficiency between Sophia and traditional first-order optimizers widens, demonstrating enhanced scalability. The optimizer's robustness is further highlighted in its ability to maintain stability against hyperparameter variations, a typical challenge in large-scale training settings.

Theoretical Insights

Sophia's effectiveness in adapting to heterogeneous curvatures across parameter dimensions offers a sustainable path for optimizing training processes without the extensive overhead typically associated with second-order methods. Through rigorous theoretical analysis, the authors provide runtime bounds that do not depend heavily on local condition numbers or curvature extrema, ensuring convergence efficiency.

Practical Implications

Practically, Sophia can be integrated into existing training pipelines with minimal changes to the architecture or computational framework. It leverages auto-differentiation frameworks like PyTorch and JAX to efficiently compute Hessian-vector products and other necessary operations, making it accessible for a wide array of model configurations and training infrastructures.

Conclusion

Sophia represents a noteworthy advancement for large-scale LLM training, marrying the precision of second-order methods with the efficiency required for practical deployment. Its design demonstrates that sophisticated optimization does not necessitate prohibitive computational costs, paving the way for more efficient training of increasingly larger models.

Sophia's contribution is significant for those seeking to enhance the scalability of LLM training with practical deployment under realistic computational constraints. As the field progresses, incorporating such advanced optimization strategies is likely to be crucial in addressing future challenges in AI scalability and efficiency.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 14 tweets with 3878 likes about this paper.