Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimistic Planning by Regularized Dynamic Programming (2302.14004v3)

Published 27 Feb 2023 in cs.LG and stat.ML

Abstract: We propose a new method for optimistic planning in infinite-horizon discounted Markov decision processes based on the idea of adding regularization to the updates of an otherwise standard approximate value iteration procedure. This technique allows us to avoid contraction and monotonicity arguments typically required by existing analyses of approximate dynamic programming methods, and in particular to use approximate transition functions estimated via least-squares procedures in MDPs with linear function approximation. We use our method to recover known guarantees in tabular MDPs and to provide a computationally efficient algorithm for learning near-optimal policies in discounted linear mixture MDPs from a single stream of experience, and show it achieves near-optimal statistical guarantees.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
  2. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Mach. Learn. Res., 22(98):1–76, 2021.
  3. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pp. 463–474, 2020.
  4. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
  5. Stochastic optimal control: the discrete-time case, volume 5. Athena Scientific, 1996.
  6. R-max: A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
  7. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pp. 1283–1294. PMLR, 2020.
  8. Prediction, learning, and games. Cambridge university press, 2006.
  9. Self-normalized processes: Limit theory and Statistical Applications, volume 204. Springer, 2009. Self-normalized tail bound appears in Thm. 14.7.
  10. Online Markov decision processes. Math. Oper. Res., 34(3):726–736, 2009.
  11. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, pp. 1578–1586. PMLR, 2018.
  12. A theory of regularized Markov decision processes. In International Conference on Machine Learning, pp. 2160–2169. PMLR, 2019.
  13. Nearly minimax optimal reinforcement learning for discounted MDPs. Advances in Neural Information Processing Systems, 34:22288–22300, 2021.
  14. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010.
  15. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp.  2137–2143. PMLR, 2020.
  16. Kakade, S. On the sample complexity of reinforcement learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003.
  17. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  18. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, 10(1):154–166, 1982.
  19. Strong consistency of least squares estimates in multiple regression ii. Journal of multivariate analysis, 9(3):343–361, 1979.
  20. Bandit algorithms. Cambridge University Press, 2020.
  21. Regret bounds for discounted MDPs. arXiv preprint arXiv:2002.05138, 2020.
  22. Martinet, B. Régularisation d’inéquations variationnelles par approximations successives. ESAIM: Mathematical Modelling and Numerical Analysis - Modélisation Mathématique et Analyse Numérique, 4(R3):154–158, 1970.
  23. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pp.  2010–2020, 2020.
  24. Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, 1983.
  25. A unifying view of optimism in episodic reinforcement learning. Advances in Neural Information Processing Systems, 33:1392–1403, 2020.
  26. A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
  27. Bilinear exponential family of mdps: Frequentist regret bound with tractable exploration and planning. arXiv preprint arXiv:2210.02087, 2022.
  28. Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  29. Exploration bonus for regret minimization in undiscounted discrete and continuous markov decision processes. arXiv preprint arXiv:1812.04363, 2018.
  30. Rockafellar, R. T. Monotone Operators and the Proximal Point Algorithm. SIAM Journal on Control and Optimization, 14(5):877–898, 1976.
  31. Reinforcement learning in finite MDPs: PAC analysis. The Journal of Machine Learning Research, 10:2413–2444, 2009.
  32. Reinforcement learning: An introduction. 2nd edition. 2018.
  33. Regret bounds for stochastic shortest path problems with linear function approximation. In International Conference on Machine Learning, pp. 22203–22233, 2022.
  34. Learning infinite-horizon average-reward MDPs with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pp.  3007–3015. PMLR, 2021.
  35. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pp. 6995–7004. PMLR, 2019.
  36. Provably efficient reinforcement learning for discounted MDPs with feature mapping. In International Conference on Machine Learning, pp. 12793–12802. PMLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Antoine Moulin (6 papers)
  2. Gergely Neu (52 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.