A safe exploration approach to constrained Markov decision processes (2312.00561v2)
Abstract: We consider discounted infinite horizon constrained Markov decision processes (CMDPs) where the goal is to find an optimal policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Motivated by the application of CMDPs in online learning of safety-critical systems, we focus on developing a model-free and simulator-free algorithm that ensures constraint satisfaction during learning. To this end, we develop an interior point approach based on the log barrier function of the CMDP. Under the commonly assumed conditions of Fisher non-degeneracy and bounded transfer error of the policy parameterization, we establish the theoretical properties of the algorithm. In particular, in contrast to existing CMDP approaches that ensure policy feasibility only upon convergence, our algorithm guarantees the feasibility of the policies during the learning process and converges to the $\varepsilon$-optimal policy with a sample complexity of $\tilde{\mathcal{O}}(\varepsilon{-6})$. In comparison to the state-of-the-art policy gradient-based algorithm, C-NPG-PDA, our algorithm requires an additional $\mathcal{O}(\varepsilon{-2})$ samples to ensure policy feasibility during learning with the same Fisher non-degenerate parameterization.
- Constrained policy optimization. In International conference on machine learning, pages 22–31. PMLR, 2017.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Mach. Learn. Res., 22(98):1–76, 2021.
- Regret guarantees for model-based reinforcement learning with long-term average constraints. In Uncertainty in Artificial Intelligence, pages 22–31. PMLR, 2022.
- E. Altman. Constrained Markov decision processes: stochastic modeling. Routledge, 1999.
- A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
- Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
- Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3682–3689, 2022.
- Achieving zero constraint violation for constrained reinforcement learning via conservative natural policy gradient primal-dual algorithm. arXiv preprint arXiv:2206.05850, 2022.
- J. Baxter and P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001.
- Safe model-based reinforcement learning with stability guarantees. Advances in neural information processing systems, 30, 2017.
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- Conservative safety critics for exploration. arXiv preprint arXiv:2010.14497, 2020.
- J. V. Burke. Numerical optimization. course notes, amath/math 516. 2012.
- End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3387–3395, 2019.
- A lyapunov-based approach to safe reinforcement learning. Advances in neural information processing systems, 31, 2018.
- Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.
- Natural policy gradient primal-dual method for constrained Markov decision processes. Advances in Neural Information Processing Systems, 33:8378–8390, 2020.
- Convergence and sample complexity of natural policy gradient primal-dual methods for constrained mdps. arXiv preprint arXiv:2206.02346, 2022.
- On the global optimum convergence of momentum-based policy gradient. In International Conference on Artificial Intelligence and Statistics, pages 1910–1934. PMLR, 2022.
- Stochastic policy gradient methods: Improved sample complexity for Fisher-non-degenerate policies, 2023.
- Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476. PMLR, 2018.
- A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.
- J. Garcıa and F. Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330, 2022.
- H. Günzel and H. Th. Jongen. Strong stability implies mangasarian–fromovitz constraint qualification. Optimization, 55(5-6):605–610, 2006.
- Risk-aware motion planning and control using cvar-constrained optimization. IEEE Robotics and Automation Letters, 4(4):3924–3931, 2019.
- Model-based reinforcement learning for infinite-horizon discounted constrained markov decision processes. In International Joint Conference on Artificial Intelligence (IJCAI), 2021.
- Safe reinforcement learning using probabilistic shields. 2020.
- A. K. Jayant and S. Bhatnagar. Model-based safe deep reinforcement learning via a constrained proximal policy optimization algorithm. Advances in Neural Information Processing Systems, 35:24432–24445, 2022.
- Contextual decision processes with low Bellman rank are PAC-learnable. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1704–1713. PMLR, 06–11 Aug 2017.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
- Safe posterior sampling for constrained mdps with bounded constraint violation. arXiv preprint arXiv:2301.11547, 2023.
- Deep constrained q-learning. arXiv preprint arXiv:2003.09398, 2020.
- Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
- J. M. Kohler and A. Lucchi. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, pages 1895–1904. PMLR, 2017.
- Learning-based model predictive control for safe exploration. In 2018 IEEE conference on decision and control (CDC), pages 6059–6066. IEEE, 2018.
- R. Koppejan and S. Whiteson. Neuroevolutionary reinforcement learning for generalized control of simulated helicopters. Evolutionary intelligence, 4:219–241, 2011.
- T. Lattimore and M. Hutter. PAC bounds for discounted mdps. In International Conference on Algorithmic Learning Theory, pages 320–334. Springer, 2012.
- Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.
- Accelerated primal-dual policy optimization for safe reinforcement learning. arXiv preprint arXiv:1802.06480, 2018.
- Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021.
- Policy optimization for constrained mdps with provable fast global convergence, 2022.
- Ipo: Interior-point policy optimization under constraints. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 4940–4947, 2020.
- Policy learning with constraints in model-free reinforcement learning: A survey. In IJCAI, pages 4508–4515, 2021.
- An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33:7624–7636, 2020.
- O. L. Mangasarian and S. Fromovitz. The Fritz John necessary optimality conditions in the presence of equality and inequality constraints. Journal of Mathematical Analysis and applications, 17(1):37–47, 1967.
- Stochastic second-order methods improve best-known sample complexity of sgd for gradient-dominated functions. Advances in Neural Information Processing Systems, 35:10862–10875, 2022.
- High-level decision making for safe and reasonable autonomous lane changing using reinforcement learning. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 2156–2162. IEEE, 2018.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots, 39:555–571, 2015.
- M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
- Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- Reinforcement learning: An introduction. MIT press, 2018.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- Reward constrained policy optimization. In International Conference on Learning Representations, 2019.
- Log barriers for safe black-box optimization with application to safe reinforcement learning. arXiv preprint arXiv:2207.10415, 2022.
- Safe reinforcement learning for emergency load shedding of power systems. In 2021 IEEE Power & Energy Society General Meeting (PESGM), pages 1–5, 2021.
- Safe exploration and optimization of constrained mdps using gaussian processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations, 2020.
- A provably-efficient model-free algorithm for infinite-horizon average-reward constrained markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3868–3876, 2022.
- Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation. In International Conference on Artificial Intelligence and Statistics, pages 3274–3307. PMLR, 2022.
- R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, pages 5–32, 1992.
- Sample efficient policy gradient methods with recursive variance reduction. In International Conference on Learning Representations, 2020.
- Macro action selection with deep reinforcement learning in starcraft. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 15, pages 94–99, 2019.
- CRPO: A new approach for safe reinforcement learning with convergence guarantee. In International Conference on Machine Learning, pages 11480–11491. PMLR, 2021.
- L. Yang and M. Wang. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004. PMLR, 2019.
- Projection-based constrained policy optimization. arXiv preprint arXiv:2010.03152, 2020.
- Linear convergence of natural policy gradient methods with log-linear policies. arXiv preprint arXiv:2210.01400, 2022.
- A general sample complexity analysis of vanilla policy gradient. In International Conference on Artificial Intelligence and Statistics, pages 3332–3380. PMLR, 2022.
- Cmdp-based intelligent transmission for wireless body area network in remote health monitoring. Neural computing and applications, 32:829–837, 2020.
- Finite-time complexity of online primal-dual natural actor-critic algorithm for constrained markov decision processes. In 2022 IEEE 61st Conference on Decision and Control (CDC), pages 4028–4033. IEEE, 2022.
- Non-cooperative inverse reinforcement learning. Advances in neural information processing systems, 32, 2019.
- L. Zheng and L. Ratliff. Constrained upper confidence reinforcement learning. In Learning for Dynamics and Control, pages 620–629. PMLR, 2020.