Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 64 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

A general sample complexity analysis of vanilla policy gradient (2107.11433v5)

Published 23 Jul 2021 in cs.LG, math.OC, and stat.ML

Abstract: We adapt recent tools developed for the analysis of Stochastic Gradient Descent (SGD) in non-convex optimization to obtain convergence and sample complexity guarantees for the vanilla policy gradient (PG). Our only assumptions are that the expected return is smooth w.r.t. the policy parameters, that its $H$-step truncated gradient is close to the exact gradient, and a certain ABC assumption. This assumption requires the second moment of the estimated gradient to be bounded by $A\geq 0$ times the suboptimality gap, $B \geq 0$ times the norm of the full batch gradient and an additive constant $C \geq 0$, or any combination of aforementioned. We show that the ABC assumption is more general than the commonly used assumptions on the policy space to prove convergence to a stationary point. We provide a single convergence theorem that recovers the $\widetilde{\mathcal{O}}(\epsilon{-4})$ sample complexity of PG to a stationary point. Our results also affords greater flexibility in the choice of hyper parameters such as the step size and the batch size $m$, including the single trajectory case (i.e., $m=1$). When an additional relaxed weak gradient domination assumption is available, we establish a novel global optimum convergence theory of PG with $\widetilde{\mathcal{O}}(\epsilon{-3})$ sample complexity. We then instantiate our theorems in different settings, where we both recover existing results and obtain improved sample complexity, e.g., $\widetilde{\mathcal{O}}(\epsilon{-3})$ sample complexity for the convergence to the global optimum for Fisher-non-degenerated parametrized policies.

Citations (56)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.