Minimizing Finite Sums with the Stochastic Average Gradient (1309.2388v2)

Published 10 Sep 2013 in math.OC, cs.LG, stat.CO, and stat.ML

Abstract: We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from O(1/k^{1/2}) to O(1/k) in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear O(1/k) to a linear convergence rate of the form O(p^k) for p \textless{} 1. Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies.

Citations (1,219)

View on Semantic Scholar

Summary

The paper presents the SAG method that leverages memory of past gradients to significantly improve convergence rates over standard stochastic gradient techniques.
For general convex functions, SAG increases the convergence rate from O(1/√k) to O(1/k) and achieves linear convergence for strongly convex problems.
Experimental results demonstrate SAG’s scalability and robustness, outperforming traditional stochastic and full gradient methods on various large-scale datasets.

Minimizing Finite Sums with the Stochastic Average Gradient

The paper "Minimizing Finite Sums with the Stochastic Average Gradient" by Schmidt, Le Roux, and Bach addresses the optimization problem commonly encountered in machine learning, where the objective is to minimize the sum of a finite number of smooth convex functions, such as in least-squares and logistic regression.

Overview

The primary contribution of this paper is the introduction and analysis of the Stochastic Average Gradient (SAG) algorithm, which provides a novel approach to the traditional stochastic gradient (SG) methods. SG methods are favored for their iteration cost being independent of the number of terms in the sum, making them viable for large datasets. However, their convergence rate is suboptimal compared to full gradient (FG) methods. The SAG algorithm cleverly incorporates a memory of previous gradient values, leading to a significantly improved convergence rate.

Numerical and Theoretical Contributions

The theoretical advances of the SAG method can be summarized as follows:

Iteration Cost: The iteration cost of the SAG method remains independent of the number of terms in the sum, similar to SG methods.
Improved Convergence Rates:
- For general convex functions, SAG improves the convergence rate from $O(1/\sqrt{k})$ (SG methods) to $O(1/k)$ .
- For strongly convex functions, SAG achieves a linear convergence rate of the form $O(\rho^k)$ , similar to FG methods, where $\rho < 1$ is dependent on the problem's conditioning.
Use of Memory: By maintaining a memory of past gradient values, the SAG algorithm achieves a faster rate compared to black-box SG methods without incurring the high iteration costs typical of FG methods.

Experimental Results

Empirical evaluations underline the robustness and efficiency of the SAG algorithm:

Comparison to SG and FG Methods: The SAG method outperforms standard SG methods after the initial iterations and continues to show significant improvements over multiple passes through the dataset. Even compared to sophisticated FG methods like L-BFGS, which show strong performance in later iterations, SAG maintains a competitive edge with fewer passes through the data.
Performance on Different Datasets: Across various datasets, SAG consistently demonstrates faster convergence compared to both SG and FG methods, particularly in scenarios where a few passes over the data are permissible but full optimization via FG methods is impractical.

Practical Implications and Developments

The implications of this work extend both practically and theoretically. Practically, SAG's low iteration cost combined with its rapid convergence makes it suitable for large-scale optimization tasks common in machine learning and data analysis. Theoretically, the introduction of memory into SG methods opens pathways to further enhancements in optimization algorithms.

Future Developments:

Proximal and Coordinate-Wise Extensions: There is potential to extend SAG to handle non-smooth objectives via proximal gradient techniques and to optimize a subset of variables on each iteration through coordinate-wise methods.
Newton-Like Variants: Incorporating approximate second-order information for even faster convergence.
Relaxing Convexity Assumptions: Exploring the utility of SAG for non-convex optimization problems.
Non-Uniform Sampling: Empirical results suggest that SAG's performance can be further optimized through non-uniform sampling strategies, particularly for ill-conditioned problems.

In conclusion, the SAG method represents a significant advancement in the optimization of large-scale convex problems, combining the low iteration cost characteristic of SG methods with the fast convergence rates of FG methods. This combination positions SAG as a robust and efficient choice for practical large-scale optimization scenarios.

PDF Markdown