Reward Machines for Cooperative Multi-Agent Reinforcement Learning

Published 3 Jul 2020 in cs.MA and cs.AI | (2007.01962v2)

Abstract: In cooperative multi-agent reinforcement learning, a collection of agents learns to interact in a shared environment to achieve a common goal. We propose the use of reward machines (RM) -- Mealy machines used as structured representations of reward functions -- to encode the team's task. The proposed novel interpretation of RMs in the multi-agent setting explicitly encodes required teammate interdependencies, allowing the team-level task to be decomposed into sub-tasks for individual agents. We define such a notion of RM decomposition and present algorithmically verifiable conditions guaranteeing that distributed completion of the sub-tasks leads to team behavior accomplishing the original task. This framework for task decomposition provides a natural approach to decentralized learning: agents may learn to accomplish their sub-tasks while observing only their local state and abstracted representations of their teammates. We accordingly propose a decentralized q-learning algorithm. Furthermore, in the case of undiscounted rewards, we use local value functions to derive lower and upper bounds for the global value function corresponding to the team task. Experimental results in three discrete settings exemplify the effectiveness of the proposed RM decomposition approach, which converges to a successful team policy an order of magnitude faster than a centralized learner and significantly outperforms hierarchical and independent q-learning approaches.

Abstract PDF Upgrade to Chat

Citations (41)

View on Semantic Scholar

Summary

The paper introduces using reward machines (RMs) to structure cooperative MARL tasks, allowing decomposition into sub-tasks for decentralized agent learning and improving convergence.
It proposes decomposing team tasks using RM projections onto local agent events and a decentralized Q-learning algorithm that learns independently with shared event synchronization.
Experimental results show order-of-magnitude improvements in convergence speed and scalability, enabling successful learning in configurations with up to ten agents.

Reward Machines for Cooperative Multi-Agent Reinforcement Learning

The paper "Reward Machines for Cooperative Multi-Agent Reinforcement Learning" by Cyrus Neary, Zhe Xu, Bo Wu, and Ufuk Topcu introduces a structured representation approach using reward machines (RMs) for cooperative multi-agent reinforcement learning (MARL) tasks. RMs serve as Mealy machines that facilitate the modeling of multi-agent tasks through encoded reward functions, enabling the decomposition of team-level tasks into sub-tasks for individual agents. This structured framework addresses the non-stationarity and coordination challenges inherent in MARL, allowing agents to pursue decentralized learning with locally available state information and abstracted teammate interactions.

Key Contributions

The primary contribution of this work is the innovative application of reward machines to decentralized MARL, highlighting the capability to explicitly encode inter-agent dependencies and facilitate local task decomposition. This approach enables each agent to focus on sub-tasks while decentralizing learning processes, thereby reducing computational complexity and improving learning convergence rates.

Reward Machine (RM) Representation: The reward machine acts as a formalism that captures task structures in a multi-agent environment using events to trigger transitions, offering a temporal abstraction that simplifies the derivation and execution of agent policies.
Decomposition Through RM Projections: The concept of RM projection onto local event sets is introduced, where the team task represented by an RM is decomposed into separate RMs for each agent.
- Bisimilarity Condition: The paper defines conditions under which the behavior captured by the original RM and the parallel composition of agent-specific RMs are bisimilar. This ensures that individual agent policies collectively lead to achieving the team task.
Decentralized Learning Algorithm: A decentralized Q-learning algorithm is proposed, where each agent learns independently using its projected RM and a local labeling function. This allows agents to learn optimal sub-task policies without the need for full observability of the global environment state, mitigating non-stationarity.
- Synchronization of Shared Events: The algorithm anticipates and synchronizes events shared among agents during training, ensuring coordinated task progression.

Experimental Evaluation

The experimental evaluations are conducted in grid-based domains with various tasks such as button-pressing and rendezvous tasks with different agent configurations. The results demonstrate the efficacy of the decentralized approach, showing order-of-magnitude improvements in convergence speed and scalability compared to centralized learning (e.g., centralized Q-learning with reward machines) and existing decentralized MARL algorithms like hierarchical independent learners and independent Q-learning. Notably, the proposed decentralized algorithm achieves success in ten-agent settings where baseline methods struggle to converge.

Implications and Future Directions

From a theoretical standpoint, the framework advances the understanding of task decomposition and distributed task execution in MARL environments. Practically, it offers a pathway towards enhancing the scalability and efficiency of MARL systems in real-world applications such as cooperative robotics and autonomous systems. The structured abstract task representation provided by RMs allows for clearer delineation of agent roles and interactions necessary to simplify cooperative strategies.

Future developments could focus on:

Extension to Continuous Domains: Adapting RM-based frameworks and the decentralized learning paradigm to continuous and high-dimensional spaces, leveraging deep learning.
Automatic RM Construction: Developing methods for the automatic construction and evolution of reward machines from environment interactions in online settings to eliminate a priori task knowledge requirements.

The work opens a novel avenue for leveraging automata-theory constructs in the design of decentralized, scalable cooperative MARL systems, thus providing a robust methodology for tackling complex multi-agent coordination challenges.

Markdown Report Issue