Stochastic Gradient Descent as Approximate Bayesian Inference (1704.04289v2)

Published 13 Apr 2017 in stat.ML and cs.LG

Abstract: Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.

Citations (571)

View on Semantic Scholar

Summary

The paper demonstrates that constant SGD can approximate Bayesian inference by tuning parameters to align its stationary distribution with the true posterior.
It introduces a novel variational EM algorithm that efficiently optimizes hyperparameters within probabilistic models, even with momentum adjustments.
The study validates the optimality of Polyak averaging and extends analysis to SGMCMC methods, offering practical insights for scalable Bayesian inference.

Stochastic Gradient Descent as Approximate Bayesian Inference

This paper investigates the connection between Stochastic Gradient Descent (SGD) with constant learning rates and approximate Bayesian inference. By viewing SGD as generating samples from a Markov chain with a stationary distribution, the authors develop analytical results that position SGD as a tool for scalable approximate inference. The primary connections and propositions are summarized as follows:

Key Contributions

Constant SGD as Approximate Inference:
- The paper demonstrates that constant SGD can serve as an approximate Bayesian inference algorithm. They propose tuning parameters such that the stationary distribution of SGD closely matches the posterior distribution, minimizing KL divergence.
Variational EM Algorithm:
- The authors present a novel variational Expectation-Maximization (EM) algorithm derived from constant SGD. This algorithm optimizes hyperparameters efficiently within probabilistic models.
SGD with Momentum:
- The stochastic gradient is extended with momentum, and the authors show that the stationary distribution remains influenced by the same learning rate and minibatch size factors, thus confirming its utility in approximate inference.
Analysis of Stochastic-Gradient MCMC Algorithms:
- The paper extends its stochastic-process framework to evaluate stochastic-gradient MCMC algorithms, such as Stochastic-Gradient Langevin Dynamics and Fisher Scoring, quantifying errors due to finite learning rates.
Polyak Averaging:
- Through the stochastic perspective, the authors provide a concise proof of the optimality of Polyak averaging, introducing the Averaged Stochastic Gradient Sampler for efficient approximate MCMC sampling.

Theoretical Insights

The Ornstein-Uhlenbeck process is utilized to capture the behavior of SGD under constant learning rates, yielding results applicable to a broad class of SGD algorithms. This approach allows direct manipulation of the stationary covariance to approximate Bayesian posteriors.
By deriving optimal conditions for preconditioners and learning rates, the work shows how constant SGD can be adapted into various forms, including dense and diagonal preconditioning. These results tie into well-known adaptive methods like AdaGrad and RMSProp.

Practical Implications

The findings have substantial impacts on algorithm design and deployment in the field of probabilistic models. By harnessing constant SGD and stochastic gradient MCMC methods, practitioners can perform scalable Bayesian inference with reduced computational overhead compared to traditional sampling methods.

Future Directions

Hyperparameter Optimization: The paper illustrates a double SGD scheme for hyperparameter tuning within a Bayesian framework, offering a practical alternative to cross-validation.
Extended Analysis of MCMC Algorithms: Further exploration of the dynamic aspects of MCMC sampling under constant learning rates could elucidate deeper connections between optimization routines and inference tasks.
Iterate Averaging: As stochastic sampling methods evolve, incorporating insights from iterate averaging may drive innovations in SGMCMC efficiency, especially in high-dimensional spaces.

Conclusion

This paper prompts a reevaluation of traditional SGD's capabilities, framing it as a versatile tool for approximate Bayesian inference. The theoretical foundations laid here highlight potential pathways to more efficient and scalable inference techniques, marking a significant step in stochastic optimization's intersection with Bayesian approaches.

PDF Markdown