Optimal rates for zero-order convex optimization: the power of two function evaluations

Published 7 Dec 2013 in math.OC, cs.IT, math.IT, and stat.ML | (1312.2139v2)

Abstract: We consider derivative-free algorithms for stochastic and non-stochastic convex optimization problems that use only function values rather than gradients. Focusing on non-asymptotic bounds on convergence rates, we show that if pairs of function values are available, algorithms for $d$-dimensional optimization that use gradient estimates based on random perturbations suffer a factor of at most $\sqrt{d}$ in convergence rate over traditional stochastic gradient methods. We establish such results for both smooth and non-smooth cases, sharpening previous analyses that suggested a worse dimension dependence, and extend our results to the case of multiple ($m \ge 2$) evaluations. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, establishing the sharpness of our achievable results up to constant (sometimes logarithmic) factors.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (445)

View on Semantic Scholar

Summary

The paper demonstrates that two function evaluations in zero-order convex optimization yield convergence rates that deteriorate by at most a √d factor compared to gradient-based methods.
The authors introduce algorithmic frameworks with proximal and smoothing techniques to effectively handle both smooth and non-smooth optimization problems.
The paper establishes minimax lower bounds via information-theoretic analysis, confirming the optimality of the two-point evaluation approach in high-dimensional settings.

An Analysis of Zero-Order Convex Optimization with Two Function Evaluations

This paper examines derivative-free (zero-order) algorithms for convex optimization, primarily focusing on scenarios where gradient information is inaccessible. It discusses using stochastic and non-stochastic settings with algorithms that rely solely on function values. This exploration encompasses both smooth and non-smooth optimization scenarios, delivering an in-depth analysis of the convergence rates and their dependence on the problem's dimensionality.

Key Contributions

Improved Convergence Rates: The authors compare traditional first-order methods, where gradient or subgradient information is available, to zero-order methods that use only function evaluations. Conventional wisdom suggested a severe penalty in dimensionality when switching to zero-order methods. However, this paper demonstrates that the convergence rate deteriorates by a factor of at most $\sqrt{d}$ regardless of whether the problem is smooth or non-smooth, where $d$ is the dimensionality.
Two-Point Estimation: A significant insight presented is the advantage of using two-point evaluations to estimate gradients. The analysis shows that employing two evaluations is substantially better than a single function evaluation, especially in high-dimensional settings, where the convergence rate benefits from a dimension-dependent improvement.
Algorithmic Frameworks: The paper provides algorithmic frameworks using various proximal functions and smoothing distributions, tailored for both smooth and non-smooth convex optimization tasks. These frameworks help achieve the optimal rates of convergence derived by the authors.
Minimax Lower Bounds: Complementing their algorithmic achievements, the authors also establish lower bounds on the minimax convergence rates. These bounds confirm that their proposed two-point evaluation methods are optimal up to constant factors. The authors use information-theoretic techniques to show that achieving convergence rates beyond these bounds is infeasible.
Convergence Rate Analysis: Detailed theoretical analyses explore the specifics of convergence rates for different norms and problem settings. For instance, specific corollaries address convergence for $\ell_2$ domains, elucidating how $\sqrt{d}$ multiplicative factors affect the rate compared to full-information methods.

Implications and Future Work

The findings present broad implications for derivative-free optimization methodologies in machine learning and operational environments where gradient computations are costly or impossible. The demonstrated efficacy of using paired evaluations opens paths for more efficient algorithm designs in high-dimensional parameter spaces. Crucially, the paper's claim regarding optimal convergence rates encourages further exploration into enhancing algorithmic schemes beyond merely increasing evaluation points.

Practically, these insights can significantly aid in structured prediction, online bandit optimization, and simulation-based optimization, where derivative computations are either infeasible or impractical. The optimal rates imply that further improvements should focus on relaxing constraints: either by seeking better-than-worst-case bounds or developing hybrid methodologies that exploit structural problem information innovatively.

In conclusion, this paper rigorously dissects zero-order optimization with two function evaluations, bridging a gap in understanding the trade-offs between dimensional scalability and computation accessibility. By grounding their theoretical claims in solid algorithmic executions, the authors invite further inquiry into both the expansion of these methodologies into broader application domains and deeper theoretical advancements.

Markdown Report Issue