Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Human-compatible driving partners through data-regularized self-play reinforcement learning (2403.19648v2)

Published 28 Mar 2024 in cs.RO, cs.AI, cs.LG, and cs.MA

Abstract: A central challenge for autonomous vehicles is coordinating with humans. Therefore, incorporating realistic human agents is essential for scalable training and evaluation of autonomous driving systems in simulation. Simulation agents are typically developed by imitating large-scale, high-quality datasets of human driving. However, pure imitation learning agents empirically have high collision rates when executed in a multi-agent closed-loop setting. To build agents that are realistic and effective in closed-loop settings, we propose Human-Regularized PPO (HR-PPO), a multi-agent algorithm where agents are trained through self-play with a small penalty for deviating from a human reference policy. In contrast to prior work, our approach is RL-first and only uses 30 minutes of imperfect human demonstrations. We evaluate agents in a large set of multi-agent traffic scenes. Results show our HR-PPO agents are highly effective in achieving goals, with a success rate of 93%, an off-road rate of 3.5%, and a collision rate of 3%. At the same time, the agents drive in a human-like manner, as measured by their similarity to existing human driving logs. We also find that HR-PPO agents show considerable improvements on proxy measures for coordination with human driving, particularly in highly interactive scenarios. We open-source our code and trained agents at https://github.com/Emerge-Lab/nocturne_lab and provide demonstrations of agent behaviors at https://sites.google.com/view/driving-partners.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. No-press diplomacy from scratch. Advances in Neural Information Processing Systems, 34:18063–18074, 2021.
  2. Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning. arXiv preprint arXiv:2210.05492, 2022.
  3. End-to-end differentiable adversarial imitation learning. In International Conference on Machine Learning, pp. 390–399. PMLR, 2017.
  4. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020.
  5. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
  6. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021.
  7. Traffic simulation with aimsun. Fundamentals of traffic simulation, pp.  173–232, 2010.
  8. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
  9. CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pp.  1–16, 2017.
  10. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9710–9719, 2021.
  11. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022.
  12. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
  13. imitation: Clean imitation learning implementations. arXiv:2211.11972v1 [cs.LG], 2022. URL https://arxiv.org/abs/2211.11972.
  14. Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research. arXiv preprint arXiv:2310.08710, 2023.
  15. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
  16. “other-play” for zero-shot coordination. In International Conference on Machine Learning, pp. 4399–4410. PMLR, 2020.
  17. Off-belief learning. In International Conference on Machine Learning, pp. 4369–4379. PMLR, 2021.
  18. Human-ai coordination via human-regularized search and learning. arXiv preprint arXiv:2210.05125, 2022.
  19. Symphony: Learning realistic and diverse agents for autonomous driving simulation. In 2022 International Conference on Robotics and Automation (ICRA), pp.  2445–2451. IEEE, 2022.
  20. Modeling strong and human-like gameplay with kl-regularized search. In International Conference on Machine Learning, pp. 9695–9728. PMLR, 2022.
  21. General lane-changing model mobil for car-following models. Transportation Research Record, 1999(1):86–94, 2007.
  22. Reward (mis) design for autonomous driving. Artificial Intelligence, 316:103829, 2023.
  23. Analysis of the generalized intelligent driver model (gidm) for uncontrolled intersections. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pp.  3223–3230. IEEE, 2021.
  24. Learning existing social conventions via observationally augmented self-play. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp.  107–114, 2019.
  25. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE transactions on pattern analysis and machine intelligence, 45(3):3461–3475, 2022.
  26. Cirl: Controllable imitative reinforcement learning for vision-based self-driving. In Proceedings of the European conference on computer vision (ECCV), pp.  584–599, 2018.
  27. Microscopic traffic simulation using sumo. In The 21st IEEE International Conference on Intelligent Transportation Systems. IEEE, 2018. URL https://elib.dlr.de/124092/.
  28. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  7553–7560. IEEE, 2023.
  29. The waymo open sim agents challenge. Advances in Neural Information Processing Systems, 36, 2024.
  30. Deep learning for safe autonomous driving: Current challenges and future directions. IEEE Transactions on Intelligent Transportation Systems, 22(7):4316–4336, 2020.
  31. Wayformer: Motion forecasting via simple & efficient attention networks. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  2980–2987. IEEE, 2023.
  32. Algorithms for inverse reinforcement learning. In Icml, volume 1, pp.  2, 2000.
  33. Virtual to real reinforcement learning for autonomous driving. arXiv preprint arXiv:1704.03952, 2017.
  34. Trajeglish: Learning the language of driving scenarios. arXiv preprint arXiv:2312.04535, 2023.
  35. Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988.
  36. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021. URL http://jmlr.org/papers/v22/20-1364.html.
  37. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp.  627–635. JMLR Workshop and Conference Proceedings, 2011.
  38. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  39. Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
  40. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  41. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  42. Collaborating with humans without human data. Advances in Neural Information Processing Systems, 34:14502–14515, 2021.
  43. Trafficsim: Learning to simulate realistic multi-agent behaviors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10400–10409, 2021.
  44. Language conditioned traffic generation. arXiv preprint arXiv:2307.07947, 2023.
  45. Congested traffic states in empirical observations and microscopic simulations. Physical review E, 62(2):1805, 2000.
  46. Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world. Advances in Neural Information Processing Systems, 35:3962–3974, 2022.
  47. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  48. Bits: Bi-level imitation for traffic simulation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  2929–2936. IEEE, 2023.
  49. Learning realistic traffic agents in closed-loop. arXiv preprint arXiv:2311.01394, 2023.
Citations (3)

Summary

  • The paper demonstrates that adding a human reference policy to PPO significantly improves agents’ ability to mimic human driving behavior.
  • HR-PPO agents achieve high goal success with reduced collision and off-road rates in diverse multi-agent traffic scenarios.
  • The study shows that minimal human driving data can effectively regularize training to produce realistic autonomous driving agents through simulation.

Human-Regularized PPO for Generating Human-Compatible Driving Agents in Simulation

Introduction to Human-Regularized PPO

The development of autonomous vehicle (AV) systems that can seamlessly integrate and coordinate with human-driven vehicles presents a complex challenge. A promising approach to address this challenge involves the use of driving simulators for the scalable training and evaluation of AV systems. Existing methods, primarily built on imitation learning, have shown limitations, notably high collision rates when deployed in multi-agent environments. This paper introduces Human-Regularized Proximal Policy Optimization (HR-PPO), a multi-agent reinforcement learning algorithm designed to generate agents that are not only effective in achieving their goals (navigating to a destination) but also exhibit human-like driving behaviors. This is achieved by incorporating a regularization term that penalizes deviations from a human reference policy, thus positioning HR-PPO as an RL-first strategy that effectively aligns with human driving conventions using only 30 minutes of human driving data for regularization.

Key Contributions of HR-PPO

The paper's primary contributions are threefold:

  • It demonstrates that a regularization term in the PPO algorithm facilitates the training of agents that align closely with human driving behaviors, as evidenced by their performance in a diverse array of multi-agent traffic scenarios.
  • The HR-PPO agents exhibit a high degree of effectiveness (goal achievement with low collision and off-road rates) alongside a pronounced increase in human-likeness, as indicated by various proxy measures for human driving behavior.
  • The proposed approach underscores the advantages of multi-agent training settings. Remarkably, HR-PPO agents outperform those trained directly on the test distribution of agents, suggesting that multi-agent training may confer additional benefits over and above those obtained from single-agent training methodologies.

Theoretical and Practical Implications

The development of HR-PPO has both theoretical and practical implications for the field of autonomous driving and general AI research. Theoretically, the approach highlights the potential of leveraging small amounts of human behavioral data to regularize reinforcement learning policies, pushing the boundaries of how closely simulated agents can mimic human behavior. Practically, HR-PPO addresses a critical industry need by generating more realistic and effective driving agents for simulation-based testing, which is crucial for the safe and efficient development of autonomous driving technologies. Furthermore, the demonstrated ability of HR-PPO agents to better coordinate with human drivers in highly interactive scenarios points towards its potential in reducing the need for extensive real-world testing, thus accelerating the development cycle of AV systems.

Future Directions

While HR-PPO marks a significant step forward, several areas for future research emerge. Scaling the approach to more extensive and varied datasets could further enhance agent generalization capabilities. Also, exploring the integration of more sophisticated imitation learning methods might improve the quality and effectiveness of the regularized policies. Moreover, extending the evaluation framework to include interactions with human drivers in simulated or controlled real-world environments could provide deeper insights into the practical efficacy of the trained agents.

Finally, understanding the theoretical underpinnings of regularized multi-agent reinforcement learning, particularly in contexts lacking a precise reward function, remains an open question. Addressing these challenges could lead to more robust, effective, and human-compatible autonomous systems, significantly impacting how autonomous systems are developed and deployed in complex, multi-agent environments, such as urban traffic systems.

Concluding Remarks

HR-PPO exemplifies a novel approach to harnessing the power of reinforcement learning for the development of autonomous driving agents that effectively mimic human driving behavior. By incorporating a regularization term that closely aligns agent actions with human reference policies, HR-PPO paves the way for creating more realistic, safe, and efficient training environments for autonomous vehicles, representing a substantial contribution to the quest for human-compatible AI systems in autonomous driving and beyond.

Youtube Logo Streamline Icon: https://streamlinehq.com