Multi-Agent Constrained Policy Optimisation (2110.02793v2)

Published 6 Oct 2021 in cs.AI and cs.MA

Abstract: Developing reinforcement learning algorithms that satisfy safety constraints is becoming increasingly important in real-world applications. In multi-agent reinforcement learning (MARL) settings, policy optimisation with safety awareness is particularly challenging because each individual agent has to not only meet its own safety constraints, but also consider those of others so that their joint behaviour can be guaranteed safe. Despite its importance, the problem of safe multi-agent learning has not been rigorously studied; very few solutions have been proposed, nor a sharable testing environment or benchmarks. To fill these gaps, in this work, we formulate the safe MARL problem as a constrained Markov game and solve it with policy optimisation methods. Our solutions -- Multi-Agent Constrained Policy Optimisation (MACPO) and MAPPO-Lagrangian -- leverage the theories from both constrained policy optimisation and multi-agent trust region learning. Crucially, our methods enjoy theoretical guarantees of both monotonic improvement in reward and satisfaction of safety constraints at every iteration. To examine the effectiveness of our methods, we develop the benchmark suite of Safe Multi-Agent MuJoCo that involves a variety of MARL baselines. Experimental results justify that MACPO/MAPPO-Lagrangian can consistently satisfy safety constraints, meanwhile achieving comparable performance to strong baselines.

Citations (42)

View on Semantic Scholar

Summary

The paper introduces a novel safety-focused framework in MARL by formulating the challenge as a constrained Markov game.
It proposes two algorithms—MACPO with hard constraints and MAPPO-Lagrangian with soft constraints using Lagrangian multipliers—for enforcing safety during policy updates.
Empirical tests on SMAMuJoCo and SMARobosuite benchmarks show that the methods maintain safety while achieving competitive rewards compared to traditional MARL techniques.

Overview

The paper introduces a novel approach to multi-agent reinforcement learning (MARL) that integrates the concept of safety, a critical aspect in many real-world scenarios. In such applications, agents must not only act to maximize their reward but also adhere to safety constraints to avoid harmful outcomes for themselves and their surrounding environment. The researchers address the lack of rigorous paper and benchmarks in safe multi-agent learning by formulating the problem as a constrained Markov game and developing two algorithms: Multi-Agent Constrained Policy Optimization (MACPO) and MAPPO-Lagrangian.

Theoretical Foundations

The foundation of these algorithms lies in the combination of constrained policy optimization and multi-agent trust region learning. These approaches extend the benefits of policy gradient methods by incorporating a constraint component, which allows the agents' policies to satisfy predefined safety requirements. The proposed algorithms enjoy solid theoretical grounding, providing guarantees of both monotonic improvement in reward and satisfaction of safety constraints at each training iteration.

Algorithmic Contributions

Two key contributions are MACPO and MAPPO-Lagrangian. MACPO utilizes hard constraints and a backtracking line search to satisfy safety constraints during policy updates. In contrast, MAPPO-Lagrangian uses a soft-constraint approach with Lagrangian multipliers to adjust policies while maintaining safety. These methods have been rigorously tested through a range of continuous control tasks in newly developed simulation environments tailored for safe MARL: Safe Multi-Agent MuJoCo (SMAMuJoCo) and Safe Multi-Agent Robosuite (SMARobosuite).

Experimental Results

Empirical evaluations carried out on the SMAMuJoCo and SMARobosuite benchmarks demonstrate the capability of both MACPO and MAPPO-Lagrangian algorithms to consistently satisfy safety constraints during training, managing to achieve comparable performance in terms of rewards when tested against traditional MARL baselines. The algorithms effectively reduced the risk associated with unsafe actions while still navigating towards high reward outcomes.

Implications and Future Work

This research presents a significant step in addressing the safety concerns in MARL environments. The development of safe policy learning is particularly relevant for deploying AI-based systems in fields like robotics, transportation, and healthcare, where violating safety constraints can have detrimental effects. Looking ahead, the researchers plan to apply these algorithms in physical environments to explore their potential in practical applications where safe interactions are imperative.

PDF Markdown