Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

Online Reinforcement Learning in Markov Decision Process Using Linear Programming (2304.00155v3)

Published 31 Mar 2023 in cs.LG, cs.SY, eess.SY, and math.OC

Abstract: We consider online reinforcement learning in episodic Markov decision process (MDP) with unknown transition function and stochastic rewards drawn from some fixed but unknown distribution. The learner aims to learn the optimal policy and minimize their regret over a finite time horizon through interacting with the environment. We devise a simple and efficient model-based algorithm that achieves $\widetilde{O}(LX\sqrt{TA})$ regret with high probability, where $L$ is the episode length, $T$ is the number of episodes, and $X$ and $A$ are the cardinalities of the state space and the action space, respectively. The proposed algorithm, which is based on the concept of ``optimism in the face of uncertainty", maintains confidence sets of transition and reward functions and uses occupancy measures to connect the online MDP with linear programming. It achieves a tighter regret bound compared to the existing works that use a similar confidence set framework and improves computational effort compared to those that use a different framework but with a slightly tighter regret bound.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. V. Leon and S. R. Etesami, “Online reinforcement learning in Markov decision process using linear programming,” in 2023 62nd IEEE Conference on Decision and Control (CDC), 2023, pp. 1973–1978.
  2. T. Gabel and M. Riedmiller, “On a successful application of multi-agent reinforcement learning to operations research benchmarks,” in 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.   IEEE, 2007, pp. 68–75.
  3. J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
  4. A. S. Polydoros and L. Nalpantidis, “Survey of model-based reinforcement learning: Applications on robotics,” Journal of Intelligent & Robotic Systems, vol. 86, no. 2, pp. 153–173, 2017.
  5. K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of Reinforcement Learning and Control, pp. 321–384, 2021.
  6. S. G. Khan, G. Herrmann, F. L. Lewis, T. Pipe, and C. Melhuish, “Reinforcement learning and optimal adaptive control: An overview and implementation examples,” Annual Reviews in Control, vol. 36, no. 1, pp. 42–59, 2012.
  7. B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 6, pp. 2042–2062, 2018.
  8. H.-D. Tran, F. Cai, M. L. Diego, P. Musau, T. T. Johnson, and X. Koutsoukos, “Safety verification of cyber-physical systems with reinforcement learning control,” ACM Trans. Embed. Comput. Syst., vol. 18, no. 5s, oct 2019.
  9. S. R. Etesami and T. Başar, “Dynamic games in cyber-physical security: An overview,” Dynamic Games and Applications, vol. 9, no. 4, pp. 884–913, 2019.
  10. A. Coronato, M. Naeem, G. De Pietro, and G. Paragliola, “Reinforcement learning for intelligent healthcare applications: A survey,” Artificial Intelligence in Medicine, vol. 109, p. 101964, 2020.
  11. Z. Wang and T. Hong, “Reinforcement learning for building controls: The opportunities and challenges,” Applied Energy, vol. 269, p. 115036, 2020.
  12. P. Auer and R. Ortner, “Logarithmic online regret bounds for undiscounted reinforcement learning,” in Advances in Neural Information Processing Systems, B. Schölkopf, J. Platt, and T. Hoffman, Eds., vol. 19.   MIT Press, 2006.
  13. T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement learning,” J. Mach. Learn. Res., vol. 11, p. 1563–1600, aug 2010.
  14. P. L. Bartlett and A. Tewari, “Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, ser. UAI ’09.   Arlington, Virginia, USA: AUAI Press, 2009, p. 35–42.
  15. M. G. Azar, I. Osband, and R. Munos, “Minimax regret bounds for reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70.   PMLR, 06–11 Aug 2017, pp. 263–272.
  16. C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan, “Is Q-learning provably efficient?” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31.   Curran Associates, Inc., 2018.
  17. S. Agrawal and R. Jia, “Optimistic posterior sampling for reinforcement learning: worst-case regret bounds,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.   Curran Associates, Inc., 2017.
  18. A. Zimin and G. Neu, “Online learning in episodic markovian decision processes by relative entropy policy search,” in Advances in Neural Information Processing Systems, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, Eds., vol. 26.   Curran Associates, Inc., 2013.
  19. A. Rosenberg and Y. Mansour, “Online convex optimization in adversarial Markov decision processes,” in International Conference on Machine Learning, 2019.
  20. C. Jin, T. Jin, H. Luo, S. Sra, and T. Yu, “Learning adversarial Markov decision processes with bandit feedback and unknown transition,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119.   PMLR, 13–18 Jul 2020, pp. 4860–4869.
  21. G. Neu, A. Gyorgy, and C. Szepesvari, “The adversarial stochastic shortest path problem with unknown transition probabilities,” in Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, N. D. Lawrence and M. Girolami, Eds., vol. 22.   La Palma, Canary Islands: PMLR, 21–23 Apr 2012, pp. 805–813.
  22. A. Rosenberg and Y. Mansour, “Online stochastic shortest path with bandit feedback and unknown transition function,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32.   Curran Associates, Inc., 2019.

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com