Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs (2403.11477v2)

Published 18 Mar 2024 in cs.LG, cs.IT, math.IT, math.OC, and stat.ML

Abstract: We study the sample complexity of learning an $\varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. For weakly communicating MDPs, we establish the complexity bound $\widetilde{O}(SA\frac{H}{\varepsilon2} )$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters $S,A,H$, and $\varepsilon$, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. We also initiate the study of sample complexity in general (multichain) average-reward MDPs. We argue a new transient time parameter $B$ is necessary, establish an $\widetilde{O}(SA\frac{B + H}{\varepsilon2})$ complexity bound, and prove a matching (up to log factors) minimax lower bound. Both results are based on reducing the average-reward MDP to a discounted MDP, which requires new ideas in the general setting. To optimally analyze this reduction, we develop improved bounds for $\gamma$-discounted MDPs, showing that $\widetilde{O}(SA\frac{H}{(1-\gamma)2\varepsilon2} )$ and $\widetilde{O}(SA\frac{B + H}{(1-\gamma)2\varepsilon2} )$ samples suffice to learn $\varepsilon$-optimal policies in weakly communicating and in general MDPs, respectively. Both these results circumvent the well-known minimax lower bound of $\widetilde{\Omega}(SA\frac{1}{(1-\gamma)3\varepsilon2} )$ for $\gamma$-discounted MDPs, and establish a quadratic rather than cubic horizon dependence for a fixed MDP instance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal, April 2020. URL http://arxiv.org/abs/1906.03804. arXiv:1906.03804 [cs, math, stat] version: 3.
  2. On the Sample Complexity of Reinforcement Learning with a Generative Model, June 2012. URL http://arxiv.org/abs/1206.6461. arXiv:1206.6461 [cs, stat].
  3. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs, May 2012. URL https://arxiv.org/abs/1205.2661v1.
  4. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, 91(3):325–349, June 2013. ISSN 1573-0565. doi: 10.1007/s10994-013-5368-1. URL https://doi.org/10.1007/s10994-013-5368-1.
  5. Efficiently Solving MDPs with Stochastic Mirror Descent, August 2020. URL https://arxiv.org/abs/2008.12776v1.
  6. Towards Tight Bounds on the Sample Complexity of Average-reward MDPs, June 2021. URL http://arxiv.org/abs/2106.07046. arXiv:2106.07046 [cs, math].
  7. Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms. In Advances in Neural Information Processing Systems, volume 11. MIT Press, 1998. URL https://proceedings.neurips.cc/paper/1998/hash/99adff456950dd9629a5260c4de21858-Abstract.html.
  8. Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model. In Advances in Neural Information Processing Systems, volume 33, pages 12861–12872. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/96ea64f3a1aa2fd00c72faacf0cb8ac9-Abstract.html.
  9. Stochastic first-order methods for average-reward Markov decision processes, May 2022. URL https://arxiv.org/abs/2205.05800v5.
  10. Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, August 2014. ISBN 978-1-118-62587-3.
  11. Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/hash/bb03e43ffe34eeb242a2ee4a4f125e56-Abstract.html.
  12. Matthew J. Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4):794–802, December 1982. ISSN 0021-9002, 1475-6072. doi: 10.2307/3213832. URL https://www.cambridge.org/core/journals/journal-of-applied-probability/article/abs/variance-of-discounted-markov-decision-processes/AA4549BFA70081B27C0092F4BF9C661A. Publisher: Cambridge University Press.
  13. Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 1 edition, February 2019a. ISBN 978-1-108-62777-1 978-1-108-49802-9. doi: 10.1017/9781108627771. URL https://www.cambridge.org/core/product/identifier/9781108627771/type/book.
  14. Martin J. Wainwright. Variance-reduced $Q$-learning is minimax optimal, August 2019b. URL http://arxiv.org/abs/1906.04697. arXiv:1906.04697 [cs, math, stat].
  15. Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP, December 2022. URL http://arxiv.org/abs/2212.00603. arXiv:2212.00603 [cs].
  16. Optimal Sample Complexity of Reinforcement Learning for Mixing Discounted Markov Decision Processes, February 2023a. URL https://arxiv.org/abs/2302.07477v3.
  17. Optimal Sample Complexity for Average Reward Markov Decision Processes, October 2023b. URL https://arxiv.org/abs/2310.08833v1.
  18. Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes, February 2020. URL http://arxiv.org/abs/1910.07072. arXiv:1910.07072 [cs, stat].
  19. Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes, June 2023. URL http://arxiv.org/abs/2306.16394. arXiv:2306.16394 [cs].
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Matthew Zurek (10 papers)
  2. Yudong Chen (104 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com