On the Stochastic (Variance-Reduced) Proximal Gradient Method for Regularized Expected Reward Optimization (2401.12508v2)
Abstract: We consider a regularized expected reward optimization problem in the non-oblivious setting that covers many existing problems in reinforcement learning (RL). In order to solve such an optimization problem, we apply and analyze the classical stochastic proximal gradient method. In particular, the method has shown to admit an $O(\epsilon{-4})$ sample complexity to an $\epsilon$-stationary point, under standard conditions. Since the variance of the classical stochastic gradient estimator is typically large, which slows down the convergence, we also apply an efficient stochastic variance-reduce proximal gradient method with an importance sampling based ProbAbilistic Gradient Estimator (PAGE). Our analysis shows that the sample complexity can be improved from $O(\epsilon{-4})$ to $O(\epsilon{-3})$ under additional conditions. Our results on the stochastic (variance-reduced) proximal gradient method match the sample complexity of their most competitive counterparts for discounted Markov decision processes under similar settings. To the best of our knowledge, the proposed methods represent a novel approach in addressing the general regularized reward optimization problem.
- Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pp. 64–66. PMLR, 2020.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431–4506, 2021.
- Understanding the impact of entropy on policy optimization. In International conference on machine learning, pp. 151–160. PMLR, 2019.
- Reinforcement learning with general utilities: Simpler variance reduction and large state-action space. arXiv preprint arXiv:2306.01854, 2023.
- Infinite-horizon policy-gradient estimation. journal of artificial intelligence research, 15:319–350, 2001.
- Beck, A. First-order methods in optimization. SIAM, 2017.
- Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
- The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205–1223, 2007.
- Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578, 2022.
- Monte carlo policy gradient method for binary optimization. arXiv preprint arXiv:2307.00783, 2023.
- Approximate regions of attraction in learning with decision-dependent distributions. In International Conference on Artificial Intelligence and Statistics, pp. 11172–11184. PMLR, 2023.
- Stochastic optimization with decision-dependent distributions. Mathematics of Operations Research, 48(2):954–998, 2023.
- Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing systems, 31, 2018.
- Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies. arXiv preprint arXiv:2302.01734, 2023.
- Page-pg: A simple and loopless variance-reduced policy gradient method with probabilistic gradient estimation. In International Conference on Machine Learning, pp. 7223–7240. PMLR, 2022.
- The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
- Bregman gradient policy optimization. arXiv preprint arXiv:2106.12112, 2021.
- Regret minimization with performative feedback. In International Conference on Machine Learning, pp. 9760–9785. PMLR, 2022.
- Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013.
- Kakade, S. M. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
- Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
- Policy gradient for reinforcement learning with general utilities. arXiv preprint arXiv:2210.00991, 2022.
- Lan, G. First-order and stochastic optimization methods for machine learning, volume 1. Springer, 2020.
- Lan, G. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Mathematical programming, 198(1):1059–1106, 2023.
- Softmax policy gradient methods can take exponential time to converge. In Conference on Learning Theory, pp. 3107–3110. PMLR, 2021a.
- Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In International conference on machine learning, pp. 6286–6295. PMLR, 2021b.
- Finite expression method for solving high-dimensional partial differential equations. arXiv preprint arXiv:2206.10121, 2022.
- Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306, 2019.
- An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33:7624–7636, 2020.
- Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 134:105400, 2021.
- On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pp. 6820–6829. PMLR, 2020.
- Stochastic optimization for performative prediction. Advances in Neural Information Processing Systems, 33:4929–4939, 2020.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International conference on machine learning, pp. 2613–2621. PMLR, 2017.
- Stochastic variance-reduced policy gradient. In International conference on machine learning, pp. 4026–4035. PMLR, 2018.
- Sequential cost-sensitive decision making with reinforcement learning. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 259–268, 2002.
- Performative prediction. In International Conference on Machine Learning, pp. 7599–7609. PMLR, 2020.
- A hybrid stochastic policy gradient algorithm for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp. 374–385. PMLR, 2020.
- Policy gradient in lipschitz markov decision processes. Machine Learning, 100:255–283, 2015.
- Polyak, B. T. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864–878, 1963.
- Robbins, H. Some aspects of the sequential design of experiments. 1952.
- Rockafellar, R. T. Convex analysis, volume 11. Princeton university press, 1997.
- Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. PMLR, 2015.
- Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 5668–5675, 2020.
- Lectures on stochastic programming: modeling and theory. SIAM, 2021.
- Hessian aided policy gradient. In International conference on machine learning, pp. 5729–5738. PMLR, 2019.
- A finite expression method for solving high-dimensional committor problems. arXiv preprint arXiv:2306.12268, 2023.
- Reinforcement learning: An introduction. MIT press, 2018.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- Mirror descent policy optimization. arXiv preprint arXiv:2005.09814, 2020.
- Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
- Xiao, L. On the convergence rates of policy gradient methods. The Journal of Machine Learning Research, 23(1):12887–12922, 2022.
- Non-asymptotic convergence of adam-type reinforcement learning algorithms under markovian sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 10460–10468, 2021.
- Sample efficient policy gradient methods with recursive variance reduction. arXiv preprint arXiv:1909.08610, 2019.
- An improved convergence analysis of stochastic variance-reduced policy gradient. In Uncertainty in Artificial Intelligence, pp. 541–551. PMLR, 2020.
- Policy optimization with stochastic mirror descent. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8823–8831, 2022.
- Stochastic recursive momentum for policy gradient methods. arXiv preprint arXiv:2003.04302, 2020.
- A general sample complexity analysis of vanilla policy gradient. In International Conference on Artificial Intelligence and Statistics, pp. 3332–3380. PMLR, 2022.
- Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. SIAM Journal on Optimization, 33(2):1061–1091, 2023.
- Variational policy gradient method for reinforcement learning with general utilities. Advances in Neural Information Processing Systems, 33:4572–4583, 2020a.
- Sample efficient reinforcement learning with reinforce. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp. 10887–10895, 2021a.
- On the convergence and sample efficiency of variance-reduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228–2240, 2021b.
- Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6):3586–3612, 2020b.
- Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, pp. 116, 2004.
- A reinforcement learning approach to job-shop scheduling. In IJCAI, volume 95, pp. 1114–1120. Citeseer, 1995.