Emergent Mind

Human-compatible driving partners through data-regularized self-play reinforcement learning

(2403.19648)
Published Mar 28, 2024 in cs.RO , cs.AI , cs.LG , and cs.MA

Abstract

A central challenge for autonomous vehicles is coordinating with humans. Therefore, incorporating realistic human agents is essential for scalable training and evaluation of autonomous driving systems in simulation. Simulation agents are typically developed by imitating large-scale, high-quality datasets of human driving. However, pure imitation learning agents empirically have high collision rates when executed in a multi-agent closed-loop setting. To build agents that are realistic and effective in closed-loop settings, we propose Human-Regularized PPO (HR-PPO), a multi-agent algorithm where agents are trained through self-play with a small penalty for deviating from a human reference policy. In contrast to prior work, our approach is RL-first and only uses 30 minutes of imperfect human demonstrations. We evaluate agents in a large set of multi-agent traffic scenes. Results show our HR-PPO agents are highly effective in achieving goals, with a success rate of 93%, an off-road rate of 3.5%, and a collision rate of 3%. At the same time, the agents drive in a human-like manner, as measured by their similarity to existing human driving logs. We also find that HR-PPO agents show considerable improvements on proxy measures for coordination with human driving, particularly in highly interactive scenarios. We open-source our code and trained agents at https://github.com/Emerge-Lab/nocturne_lab and provide demonstrations of agent behaviors at https://sites.google.com/view/driving-partners.

Comparison of PPO and HR-PPO in episodic returns and KL-divergence, highlighting regularization impact.

Overview

  • The paper introduces Human-Regularized Proximal Policy Optimization (HR-PPO), a reinforcement learning algorithm for creating autonomous vehicle agents that behave like human drivers using a regularization term based on 30 minutes of human driving data.

  • HR-PPO agents demonstrate both high effectiveness in achieving goals with low collision rates and a similarity to human driving behavior in various multi-agent traffic scenarios.

  • The approach shows the potential of using a small amount of human behavioral data to create realistic driving agents for simulation-based autonomous vehicle testing, aiming to reduce the need for extensive real-world trials.

  • Future research directions include scaling the approach with larger datasets, integrating advanced imitation learning methods, and further exploring the theoretical aspects of regularized multi-agent reinforcement learning.

Human-Regularized PPO for Generating Human-Compatible Driving Agents in Simulation

Introduction to Human-Regularized PPO

The development of autonomous vehicle (AV) systems that can seamlessly integrate and coordinate with human-driven vehicles presents a complex challenge. A promising approach to address this challenge involves the use of driving simulators for the scalable training and evaluation of AV systems. Existing methods, primarily built on imitation learning, have shown limitations, notably high collision rates when deployed in multi-agent environments. This paper introduces Human-Regularized Proximal Policy Optimization (HR-PPO), a multi-agent reinforcement learning algorithm designed to generate agents that are not only effective in achieving their goals (navigating to a destination) but also exhibit human-like driving behaviors. This is achieved by incorporating a regularization term that penalizes deviations from a human reference policy, thus positioning HR-PPO as an RL-first strategy that effectively aligns with human driving conventions using only 30 minutes of human driving data for regularization.

Key Contributions of HR-PPO

The paper's primary contributions are threefold:

  • It demonstrates that a regularization term in the PPO algorithm facilitates the training of agents that align closely with human driving behaviors, as evidenced by their performance in a diverse array of multi-agent traffic scenarios.
  • The HR-PPO agents exhibit a high degree of effectiveness (goal achievement with low collision and off-road rates) alongside a pronounced increase in human-likeness, as indicated by various proxy measures for human driving behavior.
  • The proposed approach underscores the advantages of multi-agent training settings. Remarkably, HR-PPO agents outperform those trained directly on the test distribution of agents, suggesting that multi-agent training may confer additional benefits over and above those obtained from single-agent training methodologies.

Theoretical and Practical Implications

The development of HR-PPO has both theoretical and practical implications for the field of autonomous driving and general AI research. Theoretically, the approach highlights the potential of leveraging small amounts of human behavioral data to regularize reinforcement learning policies, pushing the boundaries of how closely simulated agents can mimic human behavior. Practically, HR-PPO addresses a critical industry need by generating more realistic and effective driving agents for simulation-based testing, which is crucial for the safe and efficient development of autonomous driving technologies. Furthermore, the demonstrated ability of HR-PPO agents to better coordinate with human drivers in highly interactive scenarios points towards its potential in reducing the need for extensive real-world testing, thus accelerating the development cycle of AV systems.

Future Directions

While HR-PPO marks a significant step forward, several areas for future research emerge. Scaling the approach to more extensive and varied datasets could further enhance agent generalization capabilities. Also, exploring the integration of more sophisticated imitation learning methods might improve the quality and effectiveness of the regularized policies. Moreover, extending the evaluation framework to include interactions with human drivers in simulated or controlled real-world environments could provide deeper insights into the practical efficacy of the trained agents.

Finally, understanding the theoretical underpinnings of regularized multi-agent reinforcement learning, particularly in contexts lacking a precise reward function, remains an open question. Addressing these challenges could lead to more robust, effective, and human-compatible autonomous systems, significantly impacting how autonomous systems are developed and deployed in complex, multi-agent environments, such as urban traffic systems.

Concluding Remarks

HR-PPO exemplifies a novel approach to harnessing the power of reinforcement learning for the development of autonomous driving agents that effectively mimic human driving behavior. By incorporating a regularization term that closely aligns agent actions with human reference policies, HR-PPO paves the way for creating more realistic, safe, and efficient training environments for autonomous vehicles, representing a substantial contribution to the quest for human-compatible AI systems in autonomous driving and beyond.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.