Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Policy Mirror Descent with Lookahead (2403.14156v3)

Published 21 Mar 2024 in cs.LG, cs.AI, and stat.ML

Abstract: Policy Mirror Descent (PMD) stands as a versatile algorithmic framework encompassing several seminal policy gradient algorithms such as natural policy gradient, with connections with state-of-the-art reinforcement learning (RL) algorithms such as TRPO and PPO. PMD can be seen as a soft Policy Iteration algorithm implementing regularized 1-step greedy policy improvement. However, 1-step greedy policies might not be the best choice and recent remarkable empirical successes in RL such as AlphaGo and AlphaZero have demonstrated that greedy approaches with respect to multiple steps outperform their 1-step counterpart. In this work, we propose a new class of PMD algorithms called $h$-PMD which incorporates multi-step greedy policy improvement with lookahead depth $h$ to the PMD update rule. To solve discounted infinite horizon Markov Decision Processes with discount factor $\gamma$, we show that $h$-PMD which generalizes the standard PMD enjoys a faster dimension-free $\gammah$-linear convergence rate, contingent on the computation of multi-step greedy policies. We propose an inexact version of $h$-PMD where lookahead action values are estimated. Under a generative model, we establish a sample complexity for $h$-PMD which improves over prior work. Finally, we extend our result to linear function approximation to scale to large state spaces. Under suitable assumptions, our sample complexity only involves dependence on the dimension of the feature map space instead of the state space size.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431–4506, 2021.
  2. A novel framework for policy mirror descent with general parameterization and linear convergence. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  3. D. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.
  4. J. Bhandari and D. Russo. On the linear convergence of policy gradient methods for finite mdps. In International Conference on Artificial Intelligence and Statistics, pages 2386–2394. PMLR, 2021.
  5. J. Bhandari and D. Russo. Global optimality guarantees for policy gradient methods. Operations Research, 2024.
  6. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
  7. G. Chen and M. Teboulle. Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Optimization, 3(3):538–543, 1993.
  8. Softtreemax: Policy gradient with tree search, 2022.
  9. Beyond the one-step greedy approach in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1387–1396. PMLR, 10–15 Jul 2018a.
  10. Multiple-step greedy policies in approximate and online reinforcement learning. In Advances in Neural Information Processing Systems, volume 31, 2018b.
  11. How to combine tree-search methods in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3494–3501, 2019.
  12. Online planning with lookahead policies, 2020.
  13. Monte-Carlo tree search as regularized policy optimization. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3769–3778. PMLR, 13–18 Jul 2020.
  14. Improve agents without retraining: Parallel tree search with off-policy correction, 2023.
  15. Optimal convergence rate for exact policy mirror descent in discounted markov decision processes. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  16. On linear and super-linear convergence of natural policy gradient algorithm. Systems & Control Letters, 164:105214, 2022. ISSN 0167-6911.
  17. J. Kiefer and J. Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363–366, 1960.
  18. L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. In Machine Learning: ECML 2006, pages 282–293, Berlin, Heidelberg, 2006a. Springer Berlin Heidelberg.
  19. L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. In J. Fürnkranz, T. Scheffer, and M. Spiliopoulou, editors, Machine Learning: ECML 2006, pages 282–293, Berlin, Heidelberg, 2006b. Springer Berlin Heidelberg. ISBN 978-3-540-46056-5.
  20. G. Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Mathematical programming, 198(1):1059–1106, 2023.
  21. T. Lattimore and C. Szepesvari. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, 2019.
  22. Y. Li and G. Lan. Policy mirror descent inherently explores action space. arXiv preprint arXiv:2303.04386, 2023.
  23. Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity. Mathematical Programming, pages 1–57, 2023.
  24. Policy gradient algorithms with monte-carlo tree search for non-markov decision processes, 2022.
  25. R. Munos. Error bounds for approximate policy iteration. In Internation Conference on Machine Learning, volume 3, pages 560–567. Citeseer, 2003.
  26. B. O’Donoghue. Efficient exploration via epistemic-risk-seeking policy optimization, 2023.
  27. Behaviour suite for reinforcement learning. In International Conference on Learning Representations, 2020.
  28. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  29. M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  30. R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1970. ISBN 9781400873173.
  31. Planning and learning with adaptive lookahead. Proceedings of the AAAI Conference on Artificial Intelligence, 37(8):9606–9613, Jun. 2023.
  32. B. Scherrer. Improved and generalized upper bounds on the complexity of policy iteration. Advances in Neural Information Processing Systems, 26, 2013.
  33. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588:604 – 609, 2019.
  34. Trust region policy optimization, 2017a.
  35. Proximal policy optimization algorithms, 2017b.
  36. Non-asymptotic analysis of monte carlo tree search, 2020.
  37. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  38. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  39. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  40. Reinforcement learning: An introduction. MIT press, 2018.
  41. C. Szepesvári. CMPUT 653: Theoretical Foundations of Reinforcement Learning. Lecture 6. Online planning - Part II., 2023. URL https://rltheory.github.io/lecture-notes/planning-in-mdps/lec6.
  42. Multi-step greedy reinforcement learning algorithms. In International Conference on Machine Learning, pages 9504–9513. PMLR, 2020.
  43. Mirror descent policy optimization. In International Conference on Learning Representations, 2022.
  44. A. Winnicki and R. Srikant. On the convergence of policy iteration-based reinforcement learning with monte carlo policy evaluation. In International Conference on Artificial Intelligence and Statistics, pages 9852–9878. PMLR, 2023.
  45. The role of lookahead and approximate policy evaluation in reinforcement learning with linear value function approximation. arXiv preprint arXiv:2109.13419, 2021.
  46. L. Xiao. On the convergence rates of policy gradient methods. The Journal of Machine Learning Research, 23(1):12887–12922, 2022.
  47. Linear convergence of natural policy gradient methods with log-linear policies. In The Eleventh International Conference on Learning Representations, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Kimon Protopapas (1 paper)
  2. Anas Barakat (13 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com