Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty (2404.12598v1)
Abstract: This paper studies continuous-time risk-sensitive reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation with the exponential-form objective. The risk-sensitive objective arises either as the agent's risk attitude or as a distributionally robust approach against the model uncertainty. Owing to the martingale perspective in Jia and Zhou (2023) the risk-sensitive RL problem is shown to be equivalent to ensuring the martingale property of a process involving both the value function and the q-function, augmented by an additional penalty term: the quadratic variation of the value process, capturing the variability of the value-to-go along the trajectory. This characterization allows for the straightforward adaptation of existing RL algorithms developed for non-risk-sensitive scenarios to incorporate risk sensitivity by adding the realized variance of the value process. Additionally, I highlight that the conventional policy gradient representation is inadequate for risk-sensitive problems due to the nonlinear nature of quadratic variation; however, q-learning offers a solution and extends to infinite horizon settings. Finally, I prove the convergence of the proposed algorithm for Merton's investment problem and quantify the impact of temperature parameter on the behavior of the learning procedure. I also conduct simulation experiments to demonstrate how risk-sensitive RL improves the finite-sample performance in the linear-quadratic control problem.
- Model-based reinforcement learning in continuous environments using real-time constrained optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29.
- Andradóttir, S. (1995). A stochastic approximation algorithm with varying bounds. Operations Research, 43(6):1037–1048.
- Baird, L. C. (1994). Reinforcement learning in continuous time: Advantage updating. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), volume 4, pages 2448–2453. IEEE.
- Risk-sensitive dynamic asset management. Applied Mathematics and Optimization, 39:337–360.
- Distributionally robust mean-variance portfolio selection with Wasserstein distances. Management Science, 68(9):6382–6410.
- Borkar, V. S. (2002). Q-learning for risk-sensitive control. Mathematics of Operations Research, 27(2):294–311.
- General bounds and finite-time improvement for the Kiefer-Wolfowitz stochastic approximation algorithm. Operations Research, 59(5):1211–1224.
- Risk-sensitive and robust decision-making: A CVaR optimization approach. Advances in Neural Information Processing Systems, 28.
- Learning equilibrium mean-variance strategy. Mathematical Finance, 33(4):1166–1212.
- Learning Merton’s strategies in an incomplete market: Recursive entropy regularization and biased Gaussian exploration. arXiv preprint arXiv:2312.11797.
- A dynamic mean-variance analysis for log returns. Management Science, 67(2):1093–1108.
- Risk-sensitive Investment Management, volume 19. World Scientific, Singapore.
- Asymptotic evaluation of certain Markov process expectations for large time. IV. Communications on Pure and Applied Mathematics, 36(2):183–212.
- Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1):219–245.
- Stochastic differential utility. Econometrica, pages 353–394.
- Robust properties of risk-sensitive control. Mathematics of Control, Signals and Systems, 13:318–332.
- Risk-sensitive soft actor-critic for robust deep reinforcement learning under distribution shifts. arXiv preprint arXiv:2402.09992.
- Substitution, risk aversion, and the temporal behavior of consumption. Econometrica, 57(4):937–969.
- Exponential Bellman equation and improved regret bounds for risk-sensitive reinforcement learning. Advances in Neural Information Processing Systems, 34:20436–20446.
- Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Advances in Neural Information Processing Systems, 33:22384–22395.
- Risk-sensitive control on an infinite time horizon. SIAM Journal on Control and Optimization, 33(6):1881–1915.
- On stochastic relaxed control for partially observed diffusions. Nagoya Mathematical Journal, 93:71–108.
- Risk-sensitive control and an optimal investment model II. The Annals of Applied Probability, 12(2):730–767.
- Actor-critic learning for mean-field control in continuous time. arXiv preprint arXiv:2303.06993.
- Maxmin expected utility with non-unique prior. Journal of Mathematical Economics, 18(2):141–153.
- Robust portfolio control with stochastic factor dynamics. Operations Research, 61(4):874–893.
- Entropy regularization for mean field games with learning. Mathematics of Operations research, 47(4):3239–3260.
- Robust control and model uncertainty. American Economic Review, 91(2):60–66.
- Robustness and ambiguity in continuous time. Journal of Economic Theory, 146(3):1195–1223.
- Jacobson, D. (1973). Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games. IEEE Transactions on Automatic control, 18(2):124–131.
- Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research, 23(1):6918–6972.
- Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research, 23(1):12603–12652.
- q-Learning in continuous time. Journal of Machine Learning Research, 24(161):1–61.
- The reinforcement learning Kelly strategy. Quantitative Finance, 22(8):1445–1464.
- Is Q-learning provably efficient? Advances in Neural Information Processing Systems, 31.
- Hamilton-Jacobi deep Q-learning for deterministic continuous-time systems with Lipschitz continuous controls. Journal of Machine Learning Research, 22(206):1–34.
- Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
- Stochastic Approximation and Recursive Algorithms, volume 35. Springer-Verlag, New York, 2 edition.
- Lai, T. L. (2003). Stochastic approximation. The Annals of Statistics, 31(2):391–406.
- Policy iterations for reinforcement learning problems in continuous time and space—Fundamental theory and methods. Automatica, 126:109421.
- Knight on risk and uncertainty. Journal of Political Economy, 95(2):394–406.
- Maenhout, P. J. (2004). Robust portfolio rules and asset pricing. Review of Financial Studies, 17(4):951–983.
- Remarks on risk-sensitive control problems. Applied Mathematics and Optimization, 52:297–310.
- Merton, R. C. (1969). Lifetime portfolio selection under uncertainty: The continuous-time case. The Review of Economics and Statistics, pages 247–257.
- Nagai, H. (1996). Bellman equations of risk-sensitive control. SIAM Journal on Control and Optimization, 34(1):74–101.
- Continuous Martingales and Brownian Motion, volume 293. Springer Science & Business Media, Berlin.
- A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407.
- A convergence theorem for non negative almost supermartingales and some applications. Optimizing Methods in Statistics, pages 233–257.
- Equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1704.06440.
- Skiadas, C. (2003). Robust control and recursive utility. Finance and Stochastics, 7:475–489.
- Sun, Y. (2006). The exact law of large numbers via Fubini extension and characterization of insurable risks. Journal of Economic Theory, 126(1):31–69.
- Optimal scheduling of entropy regulariser for continuous-time linear-quadratic reinforcement learning. SIAM Journal on Control and Optimization, 62(1):135–166.
- Making deep Q-learning methods robust to time discretization. In International Conference on Machine Learning, pages 6096–6104. PMLR.
- Exploratory HJB equations and their convergence. SIAM Journal on Control and Optimization, 60(6):3191–3216.
- Reinforcement learning for continuous-time optimal execution: actor-critic algorithm and error analysis. Available at SSRN 4378950.
- Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21(198):1–34.
- Continuous-time mean–variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4):1273–1308.
- A finite sample complexity bound for distributionally robust Q-learning. In International Conference on Artificial Intelligence and Statistics, pages 3370–3398. PMLR.
- Continuous-time q-learning for McKean-Vlasov control problems. arXiv preprint arXiv:2306.16208.
- Risk-sensitive Markov decision process and learning under general utility functions. arXiv preprint arXiv:2311.13589.
- Regret bounds for Markov decision processes with recursive optimized certainty equivalents. In International Conference on Machine Learning, pages 38400–38427. PMLR.
- Stochastic Controls: Hamiltonian Systems and HJB Equations. New York, NY: Spinger.
- Zhou, X. Y. (1992). On the existence of optimal relaxed controls of stochastic partial differential equations. SIAM Journal on Control and Optimization, 30(2):247–261.
- Yanwei Jia (10 papers)