Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization (1607.01231v4)

Published 5 Jul 2016 in math.OC, cs.LG, cs.NA, and stat.ML

Abstract: In this paper we study stochastic quasi-Newton methods for nonconvex stochastic optimization, where we assume that noisy information about the gradients of the objective function is available via a stochastic first-order oracle (SFO). We propose a general framework for such methods, for which we prove almost sure convergence to stationary points and analyze its worst-case iteration complexity. When a randomly chosen iterate is returned as the output of such an algorithm, we prove that in the worst-case, the SFO-calls complexity is $O(\epsilon^{-2})$ to ensure that the expectation of the squared norm of the gradient is smaller than the given accuracy tolerance $\epsilon$. We also propose a specific algorithm, namely a stochastic damped L-BFGS (SdLBFGS) method, that falls under the proposed framework. {Moreover, we incorporate the SVRG variance reduction technique into the proposed SdLBFGS method, and analyze its SFO-calls complexity. Numerical results on a nonconvex binary classification problem using SVM, and a multiclass classification problem using neural networks are reported.

Citations (169)

View on Semantic Scholar

Summary

The paper introduces a comprehensive framework that adapts stochastic quasi-Newton methods to nonconvex settings, proving convergence with noisy gradient estimates.
It presents a rigorous complexity analysis with worst-case SFO-call complexity of O(ε⁻²) and incorporates variance reduction techniques like SVRG to boost convergence rates.
Empirical evaluations on SVMs and neural networks on datasets such as RCV1 and MNIST demonstrate that the proposed SdLBFGS method outperforms conventional SGD in practical applications.

Analysis of Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization

The paper by Wang et al. addresses a critical area in stochastic optimization: improving the efficiency of solving nonconvex problems by employing stochastic quasi-Newton (SQN) methods. The SQN methods, widely used in deterministic settings for their efficiency and robustness, are adapted for stochastic environments where only noisy gradient information is available. This adaptation accommodates the complexities of nonconvex problems frequently encountered in machine learning and statistical applications.

Main Contributions

General Framework for SQN Methods: The authors propose a comprehensive framework for applying SQN methods to nonconvex stochastic optimization problems. This framework is built on the assumption that noisy gradient estimates can be obtained via a stochastic first-order oracle (SFO). Notably, they ensure the convergence of the proposed methods to stationary points under diminishing step sizes.
Complexity Analysis: To quantify the efficiency of the approach, the paper presents rigorous complexity analyses. They show that the worst-case SFO-calls complexity is $O(\epsilon^{-2})$ , which is novel for SQN methods within nonconvex stochastic settings.
Stochastic Damped L-BFGS Method: The paper introduces a specific adaptation of the L-BFGS quasi-Newton method, named SdLBFGS, that incorporates damping to maintain the positive definiteness of the Hessian approximations, a critical aspect in nonconvex scenarios. The SdLBFGS method does not explicitly construct the Hessian but provides a computationally efficient means to approximate it, leveraging historical gradient information adaptively updated at each iteration.
Incorporation of Variance Reduction: To further enhance the convergence rate, variance reduction techniques such as SVRG are integrated into the SdLBFGS framework. This integration allows for constant step sizes and improves convergence speed, which stands out for empirical risk minimization problems.
Empirical Evaluation: The paper reports numerical results demonstrating the advantages of the proposed methods over standard stochastic gradient descent (SGD) methods. Tests include nonconvex classification problems such as support vector machines (SVM) with a sigmoid loss and neural networks on both synthetic and real datasets, like RCV1 and MNIST.

Implications and Future Directions

This work significantly contributes to the field of nonconvex optimization, specifically in settings where gradient information is noisy. By aligning quasi-Newton methods with stochastic needs, it opens multiple pathways for more robust machine learning algorithms, especially in deep learning and reinforcement learning, where optimization landscapes are typically nonconvex and data-driven.

Theoretical Advances

The adoption of damped updates in the SQN framework ensures stability and convergence, offering a pathway to explore further theoretical extensions that could incorporate proximal operators or handle constraints directly within SQN methods.

Practical Applications

For practitioners, the proposed methods promise more efficient training processes in scenarios with vast data or when real-time learning is imperative. The ability to simplify tuning (e.g., relying on constant step sizes when using variance reduction) can facilitate deployment in practical machine learning systems.

Conclusion

The authors present a robust extension of quasi-Newton methods to stochastic settings, suitable for nonconvex problems, supported by a solid theoretical foundation and empirical validation. Their work enhances understanding and application of optimization techniques within machine learning, proposing solutions that bridge theoretical rigor and practical efficiency. Future explorations could involve adapting these methods to distributed systems or applying them to other emerging nonconvex domains in artificial intelligence.