Long-term Safe Reinforcement Learning with Binary Feedback (2401.03786v2)
Abstract: Safety is an indispensable requirement for applying reinforcement learning (RL) to real problems. Although there has been a surge of safe RL algorithms proposed in recent years, most existing work typically 1) relies on receiving numeric safety feedback; 2) does not guarantee safety during the learning process; 3) limits the problem to a priori known, deterministic transition dynamics; and/or 4) assume the existence of a known safe policy for any states. Addressing the issues mentioned above, we thus propose Long-term Binaryfeedback Safe RL (LoBiSaRL), a safe RL algorithm for constrained Markov decision processes (CMDPs) with binary safety feedback and an unknown, stochastic state transition function. LoBiSaRL optimizes a policy to maximize rewards while guaranteeing a long-term safety that an agent executes only safe state-action pairs throughout each episode with high probability. Specifically, LoBiSaRL models the binary safety function via a generalized linear model (GLM) and conservatively takes only a safe action at every time step while inferring its effect on future safety under proper assumptions. Our theoretical results show that LoBiSaRL guarantees the long-term safety constraint, with high probability. Finally, our empirical results demonstrate that our algorithm is safer than existing methods without significantly compromising performance in terms of reward.
- Improved Algorithms for Linear Stochastic Bandits. In Neural Information Processing Systems (NeurIPS).
- Constrained policy optimization. In International Conference on Machine Learning (ICML).
- Altman, E. 1999. Constrained Markov decision processes, volume 7. CRC Press.
- Safe reinforcement learning with linear function approximation. In International Conference on Machine Learning (ICML).
- Control barrier functions: Theory and applications. In European control conference (ECC).
- Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.
- Lipschitz continuity in model-based reinforcement learning. In International Conference on Machine Learning (ICML).
- Logarithmic online regret bounds for undiscounted reinforcement learning. In Neural Information Processing Systems (NeurIPS).
- Provable Safe Reinforcement Learning with Binary Feedback. In International Conference on Artificial Intelligence and Statistics (AISTAT).
- The importance of pessimism in fixed-dataset policy optimization. arXiv preprint arXiv:2009.06799.
- DOPE: Doubly optimistic and pessimistic exploration for safe reinforcement learning. Neural Information Processing Systems (NeurIPS).
- End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In AAAI conference on artificial intelligence (AAAI).
- Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research (JMLR), 18(1): 6070–6120.
- Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, 1–50.
- Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning (ICML).
- Parametric bandits: The generalized linear case. Neural Information Processing Systems (NeurIPS).
- A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research (JMLR), 16(1): 1437–1480.
- Is pessimism provably efficient for offline rl? In International Conference on Machine Learning.
- Learning-based model predictive control for safe exploration. In IEEE conference on decision and control (CDC).
- Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning (ICML).
- Perturbation-based regret analysis of predictive control in linear time varying systems. Advances in Neural Information Processing Systems, 34: 5174–5185.
- Exploration in structured reinforcement learning. Neural Information Processing Systems (NeurIPS), 31.
- Optimization of conditional value-at-risk. Journal of risk, 2: 21–42.
- Provably safe PAC-MDP exploration using analogies. In International Conference on Artificial Intelligence and Statistics (AISTAT).
- Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning (ICML).
- An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8): 1309–1331.
- Reward constrained policy optimization. arXiv preprint arXiv:1805.11074.
- Contraction theory for nonlinear stability analysis and learning-based control: A tutorial overview. Annual Reviews in Control, 52: 135–169.
- Safe exploration in finite Markov decision processes with Gaussian processes. In Neural Information Processing Systems (NeurIPS).
- Safe reinforcement learning in constrained Markov decision processes. In International Conference on Machine Learning (ICML).
- Safe Policy Optimization with Local Generalized Linear Function Approximations. Neural Information Processing Systems (NeurIPS).
- Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound. In International Conference on Machine Learning (ICML).
- Akifumi Wachi (20 papers)
- Wataru Hashimoto (9 papers)
- Kazumune Hashimoto (26 papers)