A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents (2403.06323v2)
Abstract: We study risk-sensitive RL where the goal is learn a history-dependent policy that optimizes some risk measure of cumulative rewards. We consider a family of risks called the optimized certainty equivalents (OCE), which captures important risk measures such as conditional value-at-risk (CVaR), entropic risk and Markowitz's mean-variance. In this setting, we propose two meta-algorithms: one grounded in optimism and another based on policy gradients, both of which can leverage the broad suite of risk-neutral RL algorithms in an augmented Markov Decision Process (MDP). Via a reductions approach, we leverage theory for risk-neutral RL to establish novel OCE bounds in complex, rich-observation MDPs. For the optimism-based algorithm, we prove bounds that generalize prior results in CVaR RL and that provide the first risk-sensitive bounds for exogenous block MDPs. For the gradient-based algorithm, we establish both monotone improvement and global convergence guarantees under a discrete reward assumption. Finally, we empirically show that our algorithms learn the optimal history-dependent policy in a proof-of-concept MDP, where all Markovian policies provably fail.
- Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095–20107, 2020.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431–4506, 2021.
- On the dual representation of coherent risk measures. Annals of Operations Research, 262:29–46, 2018.
- Coherent measures of risk. Mathematical finance, 9(3):203–228, 1999.
- Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
- Regret bounds for risk-sensitive reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=yJEUDfzsTX7.
- Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74(3):361–379, 2011.
- An old-new concept of convex risk measures: The optimized certainty equivalent. Mathematical Finance, 17(3):449–476, 2007.
- Algorithms for cvar optimization in mdps. Advances in neural information processing systems, 27, 2014.
- Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(167):1–51, 2018.
- Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pages 1096–1105. PMLR, 2018.
- Risk-sensitive reinforcement learning: Iterated cvar and the worst path. arXiv preprint arXiv:2206.02678, 2022.
- Provable rl with exogenous distractors via multistep inverse dynamics. arXiv preprint arXiv:2110.08847, 2021.
- Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Advances in Neural Information Processing Systems, 33:22384–22395, 2020.
- Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning. Advances in Neural Information Processing Systems, 34:20436–20446, 2021.
- Efficient risk-averse reinforcement learning. Advances in Neural Information Processing Systems, 35:32639–32652, 2022.
- Mirror learning: A unifying framework of policy optimisation. In International Conference on Machine Learning, pages 7825–7844. PMLR, 2022.
- Risk-sensitive markov decision processes. Management science, 18(7):356–369, 1972.
- Revisiting design choices in proximal policy optimization. arXiv preprint arXiv:2009.10897, 2020.
- Reinforcement learning in low-rank mdps with density features. In International Conference on Machine Learning, pages 13710–13752. PMLR, 2023.
- Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages 3395–3398. PMLR, 2018.
- Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418, 2021.
- Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
- Being optimistic to be conservative: Quickly learning a cvar policy. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 4436–4443, 2020.
- Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=-RwZOVybbj.
- Distributional reinforcement learning for risk-sensitive policies. Advances in Neural Information Processing Systems, 35:30977–30989, 2022.
- Dsac: Distributional soft actor critic for risk-sensitive reinforcement learning. arXiv preprint arXiv:2004.14547, 2020.
- Conservative offline distributional reinforcement learning. Advances in Neural Information Processing Systems, 34:19235–19247, 2021.
- HM Markowitz. Portfolio selection, the journal of finance. 7 (1). N, 1:71–91, 1952.
- Time limits in reinforcement learning. In International Conference on Machine Learning, pages 4045–4054. PMLR, 2018.
- One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
- Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
- Andrzej Ruszczyński. Risk-averse dynamic programming for markov decision processes. Mathematical programming, 125:235–261, 2010.
- Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Reinforcement learning: An introduction. MIT press, 2018.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- Policy gradient for coherent risk measures. Advances in neural information processing systems, 28, 2015a.
- Optimizing the cvar via sampling. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015b.
- Representation learning for online and offline RL in low-rank MDPs. In ICLR, 2022. URL https://openreview.net/forum?id=J4iSIR9fhY0.
- Risk-averse offline reinforcement learning. 2021.
- Near-minimax-optimal risk-sensitive reinforcement learning with cvar. International Conference on Machine Learning, 2023.
- More benefits of being distributional: Second-order bounds for reinforcement learning. arXiv preprint arXiv:2402.07198, 2024.
- Lin Xiao. On the convergence rates of policy gradient methods. Journal of Machine Learning Research, 23(282):1–36, 2022.
- The role of coverage in online reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=LQIjzPdDt3q.
- Regret bounds for markov decision processes with recursive optimized certainty equivalents. International Conference on Machine Learning, 2023.
- Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
- Provably efficient cvar rl in low-rank mdps. arXiv preprint arXiv:2311.11965, 2023.
- Is risk-sensitive reinforcement learning properly resolved? arXiv preprint arXiv:2307.00547, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.