Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators (2303.08431v4)

Published 15 Mar 2023 in cs.LG, math.OC, and stat.ML

Abstract: Nonlinear control systems with partial information to the decision maker are prevalent in a variety of applications. As a step toward studying such nonlinear systems, this work explores reinforcement learning methods for finding the optimal policy in the nearly linear-quadratic regulator systems. In particular, we consider a dynamic system that combines linear and nonlinear components, and is governed by a policy with the same structure. Assuming that the nonlinear component comprises kernels with small Lipschitz coefficients, we characterize the optimization landscape of the cost function. Although the cost function is nonconvex in general, we establish the local strong convexity and smoothness in the vicinity of the global optimizer. Additionally, we propose an initialization mechanism to leverage these properties. Building on the developments, we design a policy gradient algorithm that is guaranteed to converge to the globally optimal policy with a linear rate.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22:98:1–98:76, 2021.
  2. Optimal Control: Linear Quadratic Methods. Courier Corporation, 2007.
  3. Dimitri Bertsekas. Dynamic Programming and Optimal Control, volume 1. Athena scientific, 2012.
  4. On the linear convergence of policy gradient methods for finite mdps. In International Conference on Artificial Intelligence and Statistics, pages 2386–2394. PMLR, 2021.
  5. LQR through the lens of first order methods: Discrete-time case. arXiv preprint arXiv:1907.08921, 2019.
  6. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 2021.
  7. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, 20(4):633–679, 2020.
  8. Deep direct reinforcement learning for financial signal representation and trading. IEEE Transactions on Neural Networks and Learning Systems, 28(3):653–664, 2016.
  9. Natural policy gradient primal-dual method for constrained Markov decision processes. In NeurIPS, 2020.
  10. Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research, 17(1):4809–4874, 2016.
  11. Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies. arXiv preprint arXiv:2302.01734, 2023.
  12. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476. PMLR, 2018.
  13. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’05, page 385–394, USA, 2005. Society for Industrial and Applied Mathematics.
  14. Single-timescale actor-critic provably finds globally optimal policy. In International Conference on Learning Representations, 2021.
  15. Learning optimal controllers for linear systems with multiplicative noise via policy gradient. IEEE Transactions on Automatic Control, 66(11):5283–5298, 2021.
  16. David Gross. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3):1548–1566, 2011.
  17. Policy gradient methods for the noisy linear quadratic regulator over a finite horizon. SIAM Journal on Control and Optimization, 59(5):3359–3391, 2021.
  18. Wolfgang Härdle. Applied Nonparametric Regression. Econometric Society Monographs. Cambridge University Press, 1990.
  19. Convergence guarantees of policy optimization methods for Markovian jump linear systems. In 2020 American Control Conference (ACC), pages 2882–2887. IEEE, 2020.
  20. On the analysis of model-free methods for the linear quadratic regulator. arXiv preprint arXiv:2007.03861, 2020.
  21. Random feature maps for dot product kernels. In Artificial intelligence and statistics, pages 583–591. PMLR, 2012.
  22. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  23. Reinforcement learning and deep learning based lateral control for autonomous driving [application notes]. IEEE Computational Intelligence Magazine, 14(2):83–98, 2019.
  24. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
  25. Neural trust region/proximal policy optimization attains globally optimal policy. Advances in Neural Information Processing Systems, 32, 2019.
  26. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. In NeurIPS, 2020.
  27. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. Journal of Machine Learning Research, 21:21:1–21:51, 2019.
  28. Certainty equivalence is efficient for linear quadratic control. Advances in Neural Information Processing Systems, 32, 2019.
  29. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  30. Global exponential convergence of gradient methods over the nonconvex landscape of the linear quadratic regulator. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 7474–7479. IEEE, 2019.
  31. Exploiting linear models for model-free nonlinear control: A provably convergent policy gradient approach. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 6539–6546. IEEE, 2021.
  32. Shankar Sastry. Nonlinear Systems: Analysis, Stability, and Control, volume 10. Springer Science & Business Media, 2013.
  33. Kernel principal component analysis. In International conference on artificial neural networks, pages 583–588. Springer, 1997.
  34. Mastering the game of Go with deep neural networks and tree search. Nature, 529:484–503, 2016.
  35. Support vector machines. Springer Science & Business Media, 2008.
  36. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12, 1999.
  37. Exploration-exploitation trade-off for continuous-time episodic reinforcement learning with linear-convex models. arXiv preprint arXiv:2112.10264, 2021.
  38. Optimal scheduling of entropy regulariser for continuous-time linear-quadratic reinforcement learning. arXiv preprint arXiv:2208.04466, 2022.
  39. Feedback linearization using Gaussian processes. 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 5249–5255, 2017.
  40. Feedback linearization based on Gaussian processes with event-triggered online learning. IEEE Transactions on Automatic Control, 65:4154–4169, 2020.
  41. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices, page 210–268. Cambridge University Press, 2012.
  42. Reinforcement learning in continuous time and space: A stochastic control approach. The Journal of Machine Learning Research, 21(1):8145–8178, 2020.
  43. Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations, 2020.
  44. Feedback linearization for uncertain systems via reinforcement learning. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 1364–1371. IEEE, 2020.
  45. Lin Xiao. On the convergence rates of policy gradient methods. Journal of Machine Learning Research, 23(282):1–36, 2022.
  46. Doubly robust off-policy actor-critic: Convergence and optimality. In ICML, 2021.
  47. Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. Advances in Neural Information Processing Systems, 32, 2019.
  48. Adaptive population extremal optimization-based PID neural network for multivariable nonlinear control systems. Swarm and evolutionary computation, 44:320–334, 2019.
  49. Variational policy gradient method for reinforcement learning with general utilities. Advances in Neural Information Processing Systems, 33:4572–4583, 2020.
  50. Policy optimization for ℋ2subscriptℋ2\mathcal{H}_{2}caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT linear control with ℋ∞subscriptℋ\mathcal{H}_{\infty}caligraphic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT robustness guarantee: Implicit regularization and global convergence. SIAM Journal on Control and Optimization, 2021.
  51. Provably efficient actor-critic for risk-sensitive and robust adversarial RL: A linear-quadratic case. In International Conference on Artificial Intelligence and Statistics, pages 2764–2772. PMLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yinbin Han (3 papers)
  2. Meisam Razaviyayn (76 papers)
  3. Renyuan Xu (33 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets