Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 63 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 472 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Simple Policy Optimization (2401.16025v8)

Published 29 Jan 2024 in cs.LG

Abstract: Model-free reinforcement learning algorithms have seen remarkable progress, but key challenges remain. Trust Region Policy Optimization (TRPO) is known for ensuring monotonic policy improvement through conservative updates within a trust region, backed by strong theoretical guarantees. However, its reliance on complex second-order optimization limits its practical efficiency. Proximal Policy Optimization (PPO) addresses this by simplifying TRPO's approach using ratio clipping, improving efficiency but sacrificing some theoretical robustness. This raises a natural question: Can we combine the strengths of both methods? In this paper, we introduce Simple Policy Optimization (SPO), a novel unconstrained first-order algorithm. By slightly modifying the policy loss used in PPO, SPO can achieve the best of both worlds. Our new objective improves upon ratio clipping, offering stronger theoretical properties and better constraining the probability ratio within the trust region. Empirical results demonstrate that SPO outperforms PPO with a simple implementation, particularly for training large, complex network architectures end-to-end.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  2. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
  3. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  4. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  5. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  6. Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6672–6679, 2020.
  7. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909–4926, 2021.
  8. Motion planning for autonomous driving: The state of the art and future perspectives. IEEE Transactions on Intelligent Vehicles, 2023.
  9. A survey on policy search algorithms for learning robot controllers in a handful of trials. IEEE Transactions on Robotics, 36(2):328–347, 2019.
  10. Dexterous manipulation for multi-fingered robotic hands with reinforcement learning: a review. Frontiers in Neurorobotics, 16:861825, 2022.
  11. You only demonstrate once: Category-level manipulation from single visual demonstration. arXiv preprint arXiv:2201.12716, 2022.
  12. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  13. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
  14. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
  15. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
  16. Truly proximal policy optimization. In Uncertainty in Artificial Intelligence, pages 113–122. PMLR, 2020.
  17. Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
  18. Generalized proximal policy optimization with sample reuse. Advances in Neural Information Processing Systems, 34:11909–11919, 2021.
  19. On proximal policy optimization’s heavy-tailed gradients. In International Conference on Machine Learning, pages 3610–3619. PMLR, 2021.
  20. Clipped-objective policy gradients for pessimistic policy optimization. arXiv preprint arXiv:2311.05846, 2023.
  21. An adaptive clipping approach for proximal policy optimization. arXiv preprint arXiv:1804.06461, 2018.
  22. Phasic policy gradient. In International Conference on Machine Learning, pages 2020–2027. PMLR, 2021.
  23. Ppg reloaded: An empirical study on what matters in phasic policy gradient. In International Conference on Machine Learning, pages 36694–36713. PMLR, 2023.
  24. Joshua Achiam. Spinning Up in Deep Reinforcement Learning. 2018.
  25. You may not need ratio clipping in ppo. arXiv preprint arXiv:2202.00079, 2022.
  26. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  27. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 267–274, 2002.
  28. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
  29. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  30. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
  31. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  32. The 37 implementation details of proximal policy optimization. The ICLR Blog Track 2023, 2022.
  33. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012.
  34. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  35. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  36. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  37. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  38. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  39. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  40. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  41. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube