Emergent Mind

Multi-Agent Constrained Policy Optimisation

(2110.02793)
Published Oct 6, 2021 in cs.AI and cs.MA

Abstract

Developing reinforcement learning algorithms that satisfy safety constraints is becoming increasingly important in real-world applications. In multi-agent reinforcement learning (MARL) settings, policy optimisation with safety awareness is particularly challenging because each individual agent has to not only meet its own safety constraints, but also consider those of others so that their joint behaviour can be guaranteed safe. Despite its importance, the problem of safe multi-agent learning has not been rigorously studied; very few solutions have been proposed, nor a sharable testing environment or benchmarks. To fill these gaps, in this work, we formulate the safe MARL problem as a constrained Markov game and solve it with policy optimisation methods. Our solutions -- Multi-Agent Constrained Policy Optimisation (MACPO) and MAPPO-Lagrangian -- leverage the theories from both constrained policy optimisation and multi-agent trust region learning. Crucially, our methods enjoy theoretical guarantees of both monotonic improvement in reward and satisfaction of safety constraints at every iteration. To examine the effectiveness of our methods, we develop the benchmark suite of Safe Multi-Agent MuJoCo that involves a variety of MARL baselines. Experimental results justify that MACPO/MAPPO-Lagrangian can consistently satisfy safety constraints, meanwhile achieving comparable performance to strong baselines.

Overview

  • Introduces a new approach to multi-agent reinforcement learning (MARL) focused on safety by formulating it as a constrained Markov game.

  • Develops two algorithms, MACPO and MAPPO-Lagrangian, which integrate constraints into policy optimization to ensure safety.

  • Provides theoretical backing for the algorithms, ensuring monotonic reward improvement and adherence to safety constraints.

  • Empirically tested on new simulation environments (SMAMuJoCo and SMARobosuite), demonstrating the ability to satisfy safety constraints without compromising reward performance.

  • Highlights the importance of safe MARL in fields like robotics and healthcare, with plans for future application in physical environments.

Overview

The paper introduces a novel approach to multi-agent reinforcement learning (MARL) that integrates the concept of safety, a critical aspect in many real-world scenarios. In such applications, agents must not only act to maximize their reward but also adhere to safety constraints to avoid harmful outcomes for themselves and their surrounding environment. The researchers address the lack of rigorous study and benchmarks in safe multi-agent learning by formulating the problem as a constrained Markov game and developing two algorithms: Multi-Agent Constrained Policy Optimization (MACPO) and MAPPO-Lagrangian.

Theoretical Foundations

The foundation of these algorithms lies in the combination of constrained policy optimization and multi-agent trust region learning. These approaches extend the benefits of policy gradient methods by incorporating a constraint component, which allows the agents' policies to satisfy predefined safety requirements. The proposed algorithms enjoy solid theoretical grounding, providing guarantees of both monotonic improvement in reward and satisfaction of safety constraints at each training iteration.

Algorithmic Contributions

Two key contributions are MACPO and MAPPO-Lagrangian. MACPO utilizes hard constraints and a backtracking line search to satisfy safety constraints during policy updates. In contrast, MAPPO-Lagrangian uses a soft-constraint approach with Lagrangian multipliers to adjust policies while maintaining safety. These methods have been rigorously tested through a range of continuous control tasks in newly developed simulation environments tailored for safe MARL: Safe Multi-Agent MuJoCo (SMAMuJoCo) and Safe Multi-Agent Robosuite (SMARobosuite).

Experimental Results

Empirical evaluations carried out on the SMAMuJoCo and SMARobosuite benchmarks demonstrate the capability of both MACPO and MAPPO-Lagrangian algorithms to consistently satisfy safety constraints during training, managing to achieve comparable performance in terms of rewards when tested against traditional MARL baselines. The algorithms effectively reduced the risk associated with unsafe actions while still navigating towards high reward outcomes.

Implications and Future Work

This research presents a significant step in addressing the safety concerns in MARL environments. The development of safe policy learning is particularly relevant for deploying AI-based systems in fields like robotics, transportation, and healthcare, where violating safety constraints can have detrimental effects. Looking ahead, the researchers plan to apply these algorithms in physical environments to explore their potential in practical applications where safe interactions are imperative.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.