Risk-Sensitive RL with Optimized Certainty Equivalents via Reduction to Standard RL (2403.06323v1)
Abstract: We study Risk-Sensitive Reinforcement Learning (RSRL) with the Optimized Certainty Equivalent (OCE) risk, which generalizes Conditional Value-at-risk (CVaR), entropic risk and Markowitz's mean-variance. Using an augmented Markov Decision Process (MDP), we propose two general meta-algorithms via reductions to standard RL: one based on optimistic algorithms and another based on policy optimization. Our optimistic meta-algorithm generalizes almost all prior RSRL theory with entropic risk or CVaR. Under discrete rewards, our optimistic theory also certifies the first RSRL regret bounds for MDPs with bounded coverability, e.g., exogenous block MDPs. Under discrete rewards, our policy optimization meta-algorithm enjoys both global convergence and local improvement guarantees in a novel metric that lower bounds the true OCE risk. Finally, we instantiate our framework with PPO, construct an MDP, and show that it learns the optimal risk-sensitive policy while prior algorithms provably fail.
- Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095–20107, 2020.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431–4506, 2021.
- On the dual representation of coherent risk measures. Annals of Operations Research, 262:29–46, 2018.
- Coherent measures of risk. Mathematical finance, 9(3):203–228, 1999.
- Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
- Regret bounds for risk-sensitive reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=yJEUDfzsTX7.
- Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74(3):361–379, 2011.
- An old-new concept of convex risk measures: The optimized certainty equivalent. Mathematical Finance, 17(3):449–476, 2007.
- Algorithms for cvar optimization in mdps. Advances in neural information processing systems, 27, 2014.
- Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(167):1–51, 2018.
- Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pages 1096–1105. PMLR, 2018.
- Risk-sensitive reinforcement learning: Iterated cvar and the worst path. arXiv preprint arXiv:2206.02678, 2022.
- Provable rl with exogenous distractors via multistep inverse dynamics. arXiv preprint arXiv:2110.08847, 2021.
- Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Advances in Neural Information Processing Systems, 33:22384–22395, 2020.
- Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning. Advances in Neural Information Processing Systems, 34:20436–20446, 2021.
- Efficient risk-averse reinforcement learning. Advances in Neural Information Processing Systems, 35:32639–32652, 2022.
- Mirror learning: A unifying framework of policy optimisation. In International Conference on Machine Learning, pages 7825–7844. PMLR, 2022.
- Risk-sensitive markov decision processes. Management science, 18(7):356–369, 1972.
- Revisiting design choices in proximal policy optimization. arXiv preprint arXiv:2009.10897, 2020.
- Reinforcement learning in low-rank mdps with density features. In International Conference on Machine Learning, pages 13710–13752. PMLR, 2023.
- Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages 3395–3398. PMLR, 2018.
- Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418, 2021.
- Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
- Being optimistic to be conservative: Quickly learning a cvar policy. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 4436–4443, 2020.
- Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=-RwZOVybbj.
- Distributional reinforcement learning for risk-sensitive policies. Advances in Neural Information Processing Systems, 35:30977–30989, 2022.
- Dsac: Distributional soft actor critic for risk-sensitive reinforcement learning. arXiv preprint arXiv:2004.14547, 2020.
- Conservative offline distributional reinforcement learning. Advances in Neural Information Processing Systems, 34:19235–19247, 2021.
- HM Markowitz. Portfolio selection, the journal of finance. 7 (1). N, 1:71–91, 1952.
- Time limits in reinforcement learning. In International Conference on Machine Learning, pages 4045–4054. PMLR, 2018.
- One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
- Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
- Andrzej Ruszczyński. Risk-averse dynamic programming for markov decision processes. Mathematical programming, 125:235–261, 2010.
- Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Reinforcement learning: An introduction. MIT press, 2018.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- Policy gradient for coherent risk measures. Advances in neural information processing systems, 28, 2015a.
- Optimizing the cvar via sampling. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015b.
- Representation learning for online and offline RL in low-rank MDPs. In ICLR, 2022. URL https://openreview.net/forum?id=J4iSIR9fhY0.
- Risk-averse offline reinforcement learning. 2021.
- Near-minimax-optimal risk-sensitive reinforcement learning with cvar. International Conference on Machine Learning, 2023.
- More benefits of being distributional: Second-order bounds for reinforcement learning. arXiv preprint arXiv:2402.07198, 2024.
- Lin Xiao. On the convergence rates of policy gradient methods. Journal of Machine Learning Research, 23(282):1–36, 2022.
- The role of coverage in online reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=LQIjzPdDt3q.
- Regret bounds for markov decision processes with recursive optimized certainty equivalents. International Conference on Machine Learning, 2023.
- Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
- Provably efficient cvar rl in low-rank mdps. arXiv preprint arXiv:2311.11965, 2023.
- Is risk-sensitive reinforcement learning properly resolved? arXiv preprint arXiv:2307.00547, 2023.
- Kaiwen Wang (24 papers)
- Dawen Liang (17 papers)
- Nathan Kallus (133 papers)
- Wen Sun (124 papers)