Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty (2404.12598v1)

Published 19 Apr 2024 in cs.LG, cs.SY, eess.SY, q-fin.CP, and q-fin.PM

Abstract: This paper studies continuous-time risk-sensitive reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation with the exponential-form objective. The risk-sensitive objective arises either as the agent's risk attitude or as a distributionally robust approach against the model uncertainty. Owing to the martingale perspective in Jia and Zhou (2023) the risk-sensitive RL problem is shown to be equivalent to ensuring the martingale property of a process involving both the value function and the q-function, augmented by an additional penalty term: the quadratic variation of the value process, capturing the variability of the value-to-go along the trajectory. This characterization allows for the straightforward adaptation of existing RL algorithms developed for non-risk-sensitive scenarios to incorporate risk sensitivity by adding the realized variance of the value process. Additionally, I highlight that the conventional policy gradient representation is inadequate for risk-sensitive problems due to the nonlinear nature of quadratic variation; however, q-learning offers a solution and extends to infinite horizon settings. Finally, I prove the convergence of the proposed algorithm for Merton's investment problem and quantify the impact of temperature parameter on the behavior of the learning procedure. I also conduct simulation experiments to demonstrate how risk-sensitive RL improves the finite-sample performance in the linear-quadratic control problem.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Model-based reinforcement learning in continuous environments using real-time constrained optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29.
  2. Andradóttir, S. (1995). A stochastic approximation algorithm with varying bounds. Operations Research, 43(6):1037–1048.
  3. Baird, L. C. (1994). Reinforcement learning in continuous time: Advantage updating. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), volume 4, pages 2448–2453. IEEE.
  4. Risk-sensitive dynamic asset management. Applied Mathematics and Optimization, 39:337–360.
  5. Distributionally robust mean-variance portfolio selection with Wasserstein distances. Management Science, 68(9):6382–6410.
  6. Borkar, V. S. (2002). Q-learning for risk-sensitive control. Mathematics of Operations Research, 27(2):294–311.
  7. General bounds and finite-time improvement for the Kiefer-Wolfowitz stochastic approximation algorithm. Operations Research, 59(5):1211–1224.
  8. Risk-sensitive and robust decision-making: A CVaR optimization approach. Advances in Neural Information Processing Systems, 28.
  9. Learning equilibrium mean-variance strategy. Mathematical Finance, 33(4):1166–1212.
  10. Learning Merton’s strategies in an incomplete market: Recursive entropy regularization and biased Gaussian exploration. arXiv preprint arXiv:2312.11797.
  11. A dynamic mean-variance analysis for log returns. Management Science, 67(2):1093–1108.
  12. Risk-sensitive Investment Management, volume 19. World Scientific, Singapore.
  13. Asymptotic evaluation of certain Markov process expectations for large time. IV. Communications on Pure and Applied Mathematics, 36(2):183–212.
  14. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1):219–245.
  15. Stochastic differential utility. Econometrica, pages 353–394.
  16. Robust properties of risk-sensitive control. Mathematics of Control, Signals and Systems, 13:318–332.
  17. Risk-sensitive soft actor-critic for robust deep reinforcement learning under distribution shifts. arXiv preprint arXiv:2402.09992.
  18. Substitution, risk aversion, and the temporal behavior of consumption. Econometrica, 57(4):937–969.
  19. Exponential Bellman equation and improved regret bounds for risk-sensitive reinforcement learning. Advances in Neural Information Processing Systems, 34:20436–20446.
  20. Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Advances in Neural Information Processing Systems, 33:22384–22395.
  21. Risk-sensitive control on an infinite time horizon. SIAM Journal on Control and Optimization, 33(6):1881–1915.
  22. On stochastic relaxed control for partially observed diffusions. Nagoya Mathematical Journal, 93:71–108.
  23. Risk-sensitive control and an optimal investment model II. The Annals of Applied Probability, 12(2):730–767.
  24. Actor-critic learning for mean-field control in continuous time. arXiv preprint arXiv:2303.06993.
  25. Maxmin expected utility with non-unique prior. Journal of Mathematical Economics, 18(2):141–153.
  26. Robust portfolio control with stochastic factor dynamics. Operations Research, 61(4):874–893.
  27. Entropy regularization for mean field games with learning. Mathematics of Operations research, 47(4):3239–3260.
  28. Robust control and model uncertainty. American Economic Review, 91(2):60–66.
  29. Robustness and ambiguity in continuous time. Journal of Economic Theory, 146(3):1195–1223.
  30. Jacobson, D. (1973). Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games. IEEE Transactions on Automatic control, 18(2):124–131.
  31. Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research, 23(1):6918–6972.
  32. Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research, 23(1):12603–12652.
  33. q-Learning in continuous time. Journal of Machine Learning Research, 24(161):1–61.
  34. The reinforcement learning Kelly strategy. Quantitative Finance, 22(8):1445–1464.
  35. Is Q-learning provably efficient? Advances in Neural Information Processing Systems, 31.
  36. Hamilton-Jacobi deep Q-learning for deterministic continuous-time systems with Lipschitz continuous controls. Journal of Machine Learning Research, 22(206):1–34.
  37. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
  38. Stochastic Approximation and Recursive Algorithms, volume 35. Springer-Verlag, New York, 2 edition.
  39. Lai, T. L. (2003). Stochastic approximation. The Annals of Statistics, 31(2):391–406.
  40. Policy iterations for reinforcement learning problems in continuous time and space—Fundamental theory and methods. Automatica, 126:109421.
  41. Knight on risk and uncertainty. Journal of Political Economy, 95(2):394–406.
  42. Maenhout, P. J. (2004). Robust portfolio rules and asset pricing. Review of Financial Studies, 17(4):951–983.
  43. Remarks on risk-sensitive control problems. Applied Mathematics and Optimization, 52:297–310.
  44. Merton, R. C. (1969). Lifetime portfolio selection under uncertainty: The continuous-time case. The Review of Economics and Statistics, pages 247–257.
  45. Nagai, H. (1996). Bellman equations of risk-sensitive control. SIAM Journal on Control and Optimization, 34(1):74–101.
  46. Continuous Martingales and Brownian Motion, volume 293. Springer Science & Business Media, Berlin.
  47. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407.
  48. A convergence theorem for non negative almost supermartingales and some applications. Optimizing Methods in Statistics, pages 233–257.
  49. Equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1704.06440.
  50. Skiadas, C. (2003). Robust control and recursive utility. Finance and Stochastics, 7:475–489.
  51. Sun, Y. (2006). The exact law of large numbers via Fubini extension and characterization of insurable risks. Journal of Economic Theory, 126(1):31–69.
  52. Optimal scheduling of entropy regulariser for continuous-time linear-quadratic reinforcement learning. SIAM Journal on Control and Optimization, 62(1):135–166.
  53. Making deep Q-learning methods robust to time discretization. In International Conference on Machine Learning, pages 6096–6104. PMLR.
  54. Exploratory HJB equations and their convergence. SIAM Journal on Control and Optimization, 60(6):3191–3216.
  55. Reinforcement learning for continuous-time optimal execution: actor-critic algorithm and error analysis. Available at SSRN 4378950.
  56. Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21(198):1–34.
  57. Continuous-time mean–variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4):1273–1308.
  58. A finite sample complexity bound for distributionally robust Q-learning. In International Conference on Artificial Intelligence and Statistics, pages 3370–3398. PMLR.
  59. Continuous-time q-learning for McKean-Vlasov control problems. arXiv preprint arXiv:2306.16208.
  60. Risk-sensitive Markov decision process and learning under general utility functions. arXiv preprint arXiv:2311.13589.
  61. Regret bounds for Markov decision processes with recursive optimized certainty equivalents. In International Conference on Machine Learning, pages 38400–38427. PMLR.
  62. Stochastic Controls: Hamiltonian Systems and HJB Equations. New York, NY: Spinger.
  63. Zhou, X. Y. (1992). On the existence of optimal relaxed controls of stochastic partial differential equations. SIAM Journal on Control and Optimization, 30(2):247–261.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Yanwei Jia (10 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com