Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 126 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 120 tok/s Pro
Kimi K2 221 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Online Reinforcement Learning in Markov Decision Process Using Linear Programming (2304.00155v3)

Published 31 Mar 2023 in cs.LG, cs.SY, eess.SY, and math.OC

Abstract: We consider online reinforcement learning in episodic Markov decision process (MDP) with unknown transition function and stochastic rewards drawn from some fixed but unknown distribution. The learner aims to learn the optimal policy and minimize their regret over a finite time horizon through interacting with the environment. We devise a simple and efficient model-based algorithm that achieves $\widetilde{O}(LX\sqrt{TA})$ regret with high probability, where $L$ is the episode length, $T$ is the number of episodes, and $X$ and $A$ are the cardinalities of the state space and the action space, respectively. The proposed algorithm, which is based on the concept of ``optimism in the face of uncertainty", maintains confidence sets of transition and reward functions and uses occupancy measures to connect the online MDP with linear programming. It achieves a tighter regret bound compared to the existing works that use a similar confidence set framework and improves computational effort compared to those that use a different framework but with a slightly tighter regret bound.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. V. Leon and S. R. Etesami, “Online reinforcement learning in Markov decision process using linear programming,” in 2023 62nd IEEE Conference on Decision and Control (CDC), 2023, pp. 1973–1978.
  2. T. Gabel and M. Riedmiller, “On a successful application of multi-agent reinforcement learning to operations research benchmarks,” in 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.   IEEE, 2007, pp. 68–75.
  3. J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
  4. A. S. Polydoros and L. Nalpantidis, “Survey of model-based reinforcement learning: Applications on robotics,” Journal of Intelligent & Robotic Systems, vol. 86, no. 2, pp. 153–173, 2017.
  5. K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of Reinforcement Learning and Control, pp. 321–384, 2021.
  6. S. G. Khan, G. Herrmann, F. L. Lewis, T. Pipe, and C. Melhuish, “Reinforcement learning and optimal adaptive control: An overview and implementation examples,” Annual Reviews in Control, vol. 36, no. 1, pp. 42–59, 2012.
  7. B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 6, pp. 2042–2062, 2018.
  8. H.-D. Tran, F. Cai, M. L. Diego, P. Musau, T. T. Johnson, and X. Koutsoukos, “Safety verification of cyber-physical systems with reinforcement learning control,” ACM Trans. Embed. Comput. Syst., vol. 18, no. 5s, oct 2019.
  9. S. R. Etesami and T. Başar, “Dynamic games in cyber-physical security: An overview,” Dynamic Games and Applications, vol. 9, no. 4, pp. 884–913, 2019.
  10. A. Coronato, M. Naeem, G. De Pietro, and G. Paragliola, “Reinforcement learning for intelligent healthcare applications: A survey,” Artificial Intelligence in Medicine, vol. 109, p. 101964, 2020.
  11. Z. Wang and T. Hong, “Reinforcement learning for building controls: The opportunities and challenges,” Applied Energy, vol. 269, p. 115036, 2020.
  12. P. Auer and R. Ortner, “Logarithmic online regret bounds for undiscounted reinforcement learning,” in Advances in Neural Information Processing Systems, B. Schölkopf, J. Platt, and T. Hoffman, Eds., vol. 19.   MIT Press, 2006.
  13. T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement learning,” J. Mach. Learn. Res., vol. 11, p. 1563–1600, aug 2010.
  14. P. L. Bartlett and A. Tewari, “Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, ser. UAI ’09.   Arlington, Virginia, USA: AUAI Press, 2009, p. 35–42.
  15. M. G. Azar, I. Osband, and R. Munos, “Minimax regret bounds for reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70.   PMLR, 06–11 Aug 2017, pp. 263–272.
  16. C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan, “Is Q-learning provably efficient?” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31.   Curran Associates, Inc., 2018.
  17. S. Agrawal and R. Jia, “Optimistic posterior sampling for reinforcement learning: worst-case regret bounds,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.   Curran Associates, Inc., 2017.
  18. A. Zimin and G. Neu, “Online learning in episodic markovian decision processes by relative entropy policy search,” in Advances in Neural Information Processing Systems, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, Eds., vol. 26.   Curran Associates, Inc., 2013.
  19. A. Rosenberg and Y. Mansour, “Online convex optimization in adversarial Markov decision processes,” in International Conference on Machine Learning, 2019.
  20. C. Jin, T. Jin, H. Luo, S. Sra, and T. Yu, “Learning adversarial Markov decision processes with bandit feedback and unknown transition,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119.   PMLR, 13–18 Jul 2020, pp. 4860–4869.
  21. G. Neu, A. Gyorgy, and C. Szepesvari, “The adversarial stochastic shortest path problem with unknown transition probabilities,” in Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, N. D. Lawrence and M. Girolami, Eds., vol. 22.   La Palma, Canary Islands: PMLR, 21–23 Apr 2012, pp. 805–813.
  22. A. Rosenberg and Y. Mansour, “Online stochastic shortest path with bandit feedback and unknown transition function,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32.   Curran Associates, Inc., 2019.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com