Policy Optimization for Constrained MDPs with Provable Fast Global Convergence (2111.00552v2)

Published 31 Oct 2021 in cs.LG, cs.AI, and math.OC

Abstract: We address the problem of finding the optimal policy of a constrained Markov decision process (CMDP) using a gradient descent-based algorithm. Previous results have shown that a primal-dual approach can achieve an $\mathcal{O}(1/\sqrt{T})$ global convergence rate for both the optimality gap and the constraint violation. We propose a new algorithm called policy mirror descent-primal dual (PMD-PD) algorithm that can provably achieve a faster $\mathcal{O}(\log(T)/T)$ convergence rate for both the optimality gap and the constraint violation. For the primal (policy) update, the PMD-PD algorithm utilizes a modified value function and performs natural policy gradient steps, which is equivalent to a mirror descent step with appropriate regularization. For the dual update, the PMD-PD algorithm uses modified Lagrange multipliers to ensure a faster convergence rate. We also present two extensions of this approach to the settings with zero constraint violation and sample-based estimation. Experimental results demonstrate the faster convergence rate and the better performance of the PMD-PD algorithm compared with existing policy gradient-based algorithms.

Citations (18)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Policy Optimization for Constrained MDPs with Provable Fast Global Convergence (2111.00552v2)

Summary

Related Papers