Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

9 tokens/sec

GPT-4o

12 tokens/sec

Gemini 2.5 Pro Pro

40 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training (2305.14342v4)

Published 23 May 2023 in cs.LG, cs.CL, and math.OC

Abstract: Given the massive cost of LLM pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On LLMing with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time. Theoretically, we show that Sophia, in a much simplified setting, adapts to the heterogeneous curvatures in different parameter dimensions, and thus has a run-time bound that does not depend on the condition number of the loss.

References (81)

Citations (104)

View on Semantic Scholar

Summary

The paper introduces Sophia, a second-order optimizer that achieves up to 2× speed-up over Adam in language model pre-training.
It leverages lightweight diagonal Hessian estimation via Hutchinson’s and Gauss-Newton-Bartlett methods to efficiently capture curvature information.
The algorithm employs per-coordinate clipping to stabilize updates, reducing hyperparameter sensitivity and integrating seamlessly into existing pipelines.

Overview of "Sophia: A Scalable Stochastic Second-order Optimizer for LLM Pre-training"

The paper "Sophia: A Scalable Stochastic Second-order Optimizer for LLM Pre-training" introduces Sophia, a second-order optimization algorithm designed to expedite the pre-training process of LLMs. The key motivation behind Sophia is to achieve significant improvements in both speed and efficiency during the pre-training phase, aiming to surpass the widely used first-order adaptive methods such as Adam and its variants.

Key Contributions

The authors present several significant contributions:

Sophia Algorithm: A second-order optimizer that uses lightweight estimates of the diagonal Hessian as a pre-conditioner to update the parameters, followed by element-wise clipping to control the worst-case update size.
Efficiency: Sophia achieves a 2x speed-up over Adam in terms of the number of steps, total compute, and wall-clock time, demonstrating its practical viability.
Theoretical Insights: The paper offers theoretical analyses indicating that Sophia adapts efficiently to heterogeneous curvatures in different parameter dimensions, providing runtime bounds that are independent of the condition number of the loss.

Methodology

Diagonal Hessian Estimation

Sophia stands out by incorporating two methods for estimating the diagonal of the Hessian:

Hutchinson’s Estimator: This unbiased estimator requires only a Hessian-vector product, which is computationally feasible.
Gauss-Newton-Bartlett (GNB) Estimator: This estimator leverages the structure of the loss function, especially suitable for negative log-likelihood losses, ensuring the estimates are always positive semi-definite.

Update Mechanism

Sophia employs the following update mechanism:

Gradients and Hessians: It uses the moving average of gradients divided by the moving average of the estimated Hessian.
Clipping Mechanism: For stability and to address non-convexities and rapidly changing Hessians, it incorporates an element-wise clipping function.

Experimental Results

The authors demonstrate the efficacy of Sophia through extensive experiments on LLMs, including GPT-2 and GPT NeoX with parameter sizes ranging from 125M to 6.6B. The results consistently show:

Reduction in Steps: Sophia attains the same validation perplexity as AdamW with 50% fewer steps.
Better Scaling: The speed-up of Sophia becomes more pronounced with increasing model size, indicating favorable scaling laws.

Detailed Analysis

Sophia's improvements stem from its ability to adapt more aggressively to varying curvatures in the optimization landscape than first-order methods like Adam. Several experiments underline Sophia's:

Stable and Fast Convergence: Demonstrating better stability and less frequent gradient clipping compared to AdamW and Lion, especially in large models.
Robustness: The algorithm exhibits insensitivity to hyperparameters and can be seamlessly integrated into existing training pipelines without requiring specific architectural modifications.

Theoretical Analysis

Theoretical analysis of Sophia reveals:

Runtime Bounds: The convergence rate of Sophia does not depend on the local condition number, showcasing its robustness to varying curvature across different parameter dimensions.
Clipping Algorithm: The per-coordinate clipping mechanism ensures stability even in the presence of non-convexity, mitigating the risks associated with traditional second-order methods.

Practical and Theoretical Implications

The practical implications of Sophia are broad:

Reduced Training Costs: By halving the number of training steps required, Sophia significantly reduces the computational resources and time needed for pre-training large models.
Scalability: The favorable scaling behavior indicates its potential applicability to even larger models, making it a suitable candidate for future advancements in LLM training.

On the theoretical front:

Curvature Adaptivity: Sophia's ability to adapt to heterogeneous curvatures enhances its applicability across diverse optimization landscapes, setting a benchmark for future research in second-order optimization for deep learning.

Future Directions

Future research could explore:

Further Scalability: Extending Sophia's applicability to models exceeding 10B parameters.
Cross-Domain Use: Investigating the effectiveness of Sophia in non-LLMing domains like vision or reinforcement learning.
Algorithmic Extensions: Developing variants of Sophia that combine other Hessian approximation techniques or hybrid approaches with first-order optimizers.

In conclusion, the paper presents Sophia as a compelling alternative to first-order methods for pre-training LLMs, with significant speed-ups and robust theoretical foundations. Its potential to reshape optimization practices for large-scale neural networks is evident, providing a solid ground for future exploration and development in stochastic second-order optimization.

PDF Markdown

Tweets

https://twitter.com/tengyuma/status/1661412995430219786

https://twitter.com/dlwh/status/1767966654120735158

https://twitter.com/kellerjordan0/status/1803566078985117903

https://twitter.com/evaninwords/status/1869530642557661642

https://twitter.com/yenhuan_li/status/1767157496983220270

https://twitter.com/miltonl_/status/1793775335336775691

HackerNews

Sophia: Scalable Stochastic 2nd-Order Optimizer for Language Model Pre-Training (54 points, 2 comments)