Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Long-term Safe Reinforcement Learning with Binary Feedback (2401.03786v2)

Published 8 Jan 2024 in cs.LG, cs.AI, and cs.RO

Abstract: Safety is an indispensable requirement for applying reinforcement learning (RL) to real problems. Although there has been a surge of safe RL algorithms proposed in recent years, most existing work typically 1) relies on receiving numeric safety feedback; 2) does not guarantee safety during the learning process; 3) limits the problem to a priori known, deterministic transition dynamics; and/or 4) assume the existence of a known safe policy for any states. Addressing the issues mentioned above, we thus propose Long-term Binaryfeedback Safe RL (LoBiSaRL), a safe RL algorithm for constrained Markov decision processes (CMDPs) with binary safety feedback and an unknown, stochastic state transition function. LoBiSaRL optimizes a policy to maximize rewards while guaranteeing a long-term safety that an agent executes only safe state-action pairs throughout each episode with high probability. Specifically, LoBiSaRL models the binary safety function via a generalized linear model (GLM) and conservatively takes only a safe action at every time step while inferring its effect on future safety under proper assumptions. Our theoretical results show that LoBiSaRL guarantees the long-term safety constraint, with high probability. Finally, our empirical results demonstrate that our algorithm is safer than existing methods without significantly compromising performance in terms of reward.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Improved Algorithms for Linear Stochastic Bandits. In Neural Information Processing Systems (NeurIPS).
  2. Constrained policy optimization. In International Conference on Machine Learning (ICML).
  3. Altman, E. 1999. Constrained Markov decision processes, volume 7. CRC Press.
  4. Safe reinforcement learning with linear function approximation. In International Conference on Machine Learning (ICML).
  5. Control barrier functions: Theory and applications. In European control conference (ECC).
  6. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.
  7. Lipschitz continuity in model-based reinforcement learning. In International Conference on Machine Learning (ICML).
  8. Logarithmic online regret bounds for undiscounted reinforcement learning. In Neural Information Processing Systems (NeurIPS).
  9. Provable Safe Reinforcement Learning with Binary Feedback. In International Conference on Artificial Intelligence and Statistics (AISTAT).
  10. The importance of pessimism in fixed-dataset policy optimization. arXiv preprint arXiv:2009.06799.
  11. DOPE: Doubly optimistic and pessimistic exploration for safe reinforcement learning. Neural Information Processing Systems (NeurIPS).
  12. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In AAAI conference on artificial intelligence (AAAI).
  13. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research (JMLR), 18(1): 6070–6120.
  14. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, 1–50.
  15. Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning (ICML).
  16. Parametric bandits: The generalized linear case. Neural Information Processing Systems (NeurIPS).
  17. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research (JMLR), 16(1): 1437–1480.
  18. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning.
  19. Learning-based model predictive control for safe exploration. In IEEE conference on decision and control (CDC).
  20. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning (ICML).
  21. Perturbation-based regret analysis of predictive control in linear time varying systems. Advances in Neural Information Processing Systems, 34: 5174–5185.
  22. Exploration in structured reinforcement learning. Neural Information Processing Systems (NeurIPS), 31.
  23. Optimization of conditional value-at-risk. Journal of risk, 2: 21–42.
  24. Provably safe PAC-MDP exploration using analogies. In International Conference on Artificial Intelligence and Statistics (AISTAT).
  25. Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning (ICML).
  26. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8): 1309–1331.
  27. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074.
  28. Contraction theory for nonlinear stability analysis and learning-based control: A tutorial overview. Annual Reviews in Control, 52: 135–169.
  29. Safe exploration in finite Markov decision processes with Gaussian processes. In Neural Information Processing Systems (NeurIPS).
  30. Safe reinforcement learning in constrained Markov decision processes. In International Conference on Machine Learning (ICML).
  31. Safe Policy Optimization with Local Generalized Linear Function Approximations. Neural Information Processing Systems (NeurIPS).
  32. Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound. In International Conference on Machine Learning (ICML).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Akifumi Wachi (20 papers)
  2. Wataru Hashimoto (9 papers)
  3. Kazumune Hashimoto (26 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.