Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Posterior Sampling-based Online Learning for Episodic POMDPs (2310.10107v4)

Published 16 Oct 2023 in cs.LG, cs.AI, cs.SY, eess.SY, and stat.ML

Abstract: Learning in POMDPs is known to be significantly harder than in MDPs. In this paper, we consider the online learning problem for episodic POMDPs with unknown transition and observation models. We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms for POMDPs. We show that the Bayesian regret of the proposed algorithm scales as the square root of the number of episodes and is polynomial in the other parameters. In a general setting, the regret scales exponentially in the horizon length $H$, and we show that this is inevitable by providing a lower bound. However, when the POMDP is undercomplete and weakly revealing (a common assumption in the recent literature), we establish a polynomial Bayesian regret bound. We finally propose a posterior sampling algorithm for multi-agent POMDPs, and show it too has sublinear regret.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39–1. JMLR Workshop and Conference Proceedings, 2012.
  2. S. Agrawal and N. Goyal. Near-optimal regret bounds for Thompson sampling. Journal of the ACM (JACM), 64(5):1–24, 2017.
  3. A method of moments for mixture models and hidden Markov models. In Conference on Learning Theory, pages 33–1. JMLR Workshop and Conference Proceedings, 2012.
  4. Tensor decompositions for learning latent variable models. Journal of machine learning research, 15:2773–2832, 2014.
  5. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002a.
  6. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
  7. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008.
  8. Reinforcement learning of POMDPs using spectral methods. In Conference on Learning Theory, pages 193–256. PMLR, 2016.
  9. S. Bubeck and C.-Y. Liu. Prior-free and prior-dependent regret bounds for Thompson sampling. Advances in neural information processing systems, 26, 2013.
  10. The elliptical potential lemma revisited. arXiv preprint arXiv:2010.10182, 2020.
  11. Partially observable RL with B-stability: Unified structural condition and sharp sample-efficient algorithms. arXiv preprint arXiv:2209.14990, 2022.
  12. Lower bounds for learning in revealing POMDPs. arXiv preprint arXiv:2302.01333, 2023.
  13. Planning and learning in partially observable systems via filter stability. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 349–362, 2023.
  14. A PAC RL algorithm for episodic POMDPs. In Artificial Intelligence and Statistics, pages 510–518. PMLR, 2016.
  15. A spectral algorithm for learning hidden Markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
  16. Online learning for stochastic shortest path model via posterior sampling. arXiv preprint arXiv:2106.05335, 2021a.
  17. Learning zero-sum stochastic games with posterior sampling. arXiv preprint arXiv:2109.03396, 2021b.
  18. Online learning for unknown partially observable MDPs. In International Conference on Artificial Intelligence and Statistics, pages 1712–1732. PMLR, 2022.
  19. Sample-efficient reinforcement learning of undercomplete POMDPs. Advances in Neural Information Processing Systems, 33:18530–18539, 2020.
  20. Safe posterior sampling for constrained MDPs with bounded constraint violation. arXiv preprint arXiv:2301.11547, 2023.
  21. PAC reinforcement learning with rich observations. Advances in Neural Information Processing Systems, 29, 2016.
  22. P. R. Kumar and P. Varaiya. Stochastic systems: Estimation, identification, and adaptive control. SIAM, 2015.
  23. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  24. T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  25. Learning in POMDPs is sample-efficient with hindsight observability. In International Conference on Machine Learning, pages 18733–18773. PMLR, 2023.
  26. D. A. Levin and Y. Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  27. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
  28. When is partially observable reinforcement learning not scary? In Conference on Learning Theory, pages 5175–5220. PMLR, 2022.
  29. Optimistic MLE: A generic model-based algorithm for partially observable sequential decision making. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 363–376, 2023.
  30. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  31. I. Osband and B. Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In International conference on machine learning, pages 2701–2710. PMLR, 2017.
  32. (More) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26, 2013.
  33. Deep exploration via bootstrapped DQN. Advances in neural information processing systems, 29, 2016.
  34. Learning unknown Markov decision processes: A Thompson sampling approach. Advances in neural information processing systems, 30, 2017.
  35. Posterior sampling-based reinforcement learning for control of unknown linear systems. IEEE Transactions on Automatic Control, 65(8):3600–3607, 2019.
  36. P. Poupart and N. Vlassis. Model-based bayesian reinforcement learning in partially observable domains. In Proc Int. Symp. on Artificial Intelligence and Mathematics,, pages 1–2, 2008.
  37. Bayes-adaptive POMDPs. Advances in neural information processing systems, 20, 2007.
  38. A bayesian approach for learning and planning in partially observable Markov decision processes. Journal of Machine Learning Research, 12(5), 2011.
  39. D. Russo and B. Van Roy. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013.
  40. D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
  41. D. Russo and B. Van Roy. An information-theoretic analysis of Thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
  42. A tutorial on Thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
  43. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  44. A survey of point-based POMDP solvers. Auton. Agent Multi-Agent Syst., 27:1–51, 2013.
  45. D. Silver and J. Veness. Monte-Carlo planning in large POMDPs. Adv. Neural. Inf. Process. Syst., 23, 2010.
  46. W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
  47. Sublinear regret for learning POMDPs. Production and Operations Management, 31(9):3491–3504, 2022.
  48. Pac reinforcement learning for predictive state representations. arXiv preprint arXiv:2207.05738, 2022.
  49. GEC: A unified framework for interactive decision making in MDP, POMDP, and beyond. arXiv preprint arXiv:2211.01962, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Dengwang Tang (11 papers)
  2. Rahul Jain (152 papers)
  3. Ashutosh Nayyar (54 papers)
  4. Pierluigi Nuzzo (33 papers)
  5. Dongze Ye (2 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets