The Update-Equivalence Framework for Decision-Time Planning (2304.13138v3)

Published 25 Apr 2023 in cs.AI and cs.LG

Abstract: The process of revising (or constructing) a policy at execution time -- known as decision-time planning -- has been key to achieving superhuman performance in perfect-information games like chess and Go. A recent line of work has extended decision-time planning to imperfect-information games, leading to superhuman performance in poker. However, these methods involve solving subgames whose sizes grow quickly in the amount of non-public information, making them unhelpful when the amount of non-public information is large. Motivated by this issue, we introduce an alternative framework for decision-time planning that is not based on solving subgames, but rather on update equivalence. In this update-equivalence framework, decision-time planning algorithms replicate the updates of last-iterate algorithms, which need not rely on public information. This facilitates scalability to games with large amounts of non-public information. Using this framework, we derive a provably sound search algorithm for fully cooperative games based on mirror descent and a search algorithm for adversarial games based on magnetic mirror descent. We validate the performance of these algorithms in cooperative and adversarial domains, notably in Hanabi, the standard benchmark for search in fully cooperative imperfect-information games. Here, our mirror descent approach exceeds or matches the performance of public information-based search while using two orders of magnitude less search time. This is the first instance of a non-public-information-based algorithm outperforming public-information-based approaches in a domain they have historically dominated.

Authors (7)

Samuel Sokota (15 papers)
Gabriele Farina (78 papers)
David J. Wu (9 papers)
Hengyuan Hu (22 papers)
Kevin A. Wang (3 papers)
J. Zico Kolter (151 papers)
Noam Brown (25 papers)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces an update-equivalence framework that replicates last-iterate algorithm updates to enhance strategic decision-making in games with imperfect information.
It employs mirror descent and magnetic mirror descent techniques to ensure robust policy improvement in cooperative and adversarial settings.
Empirical results in environments like Hanabi and Phantom Tic-Tac-Toe demonstrate superior performance and reduced computational costs compared to PBS methods.

An Overview of the Update-Equivalence Framework for Decision-Time Planning

The paper "The Update-Equivalence Framework for Decision-Time Planning" introduces an innovative method of decision-time planning (DTP) that enhances the scalability and efficacy of strategic decision-making in games, particularly those with imperfect information. This framework offers a noteworthy alternative to traditional methods dependent on public belief states (PBS) by leveraging a concept termed "update equivalence." The primary objective of this approach is to design DTP algorithms that replicate the updates of last-iterate algorithms, thus allowing for scalability in complex environments with extensive non-public information.

Core Concepts and Methodology

The update-equivalence framework posits that DTP algorithms can mimic the update steps employed in last-iterate algorithms. These algorithms are characterized by their updates leading to equilibrium states or improvements in expected returns. Key to this framework is the insight that these updates need not rely on extensive public information, thereby overcoming limitations associated with PBS-based planning. This is particularly advantageous in scenarios with a substantial amount of non-public information, where PBS methods struggle due to the exponential growth in complexity.

The paper develops algorithms under this framework using mirror descent and magnetic mirror descent as foundational techniques. Mirror descent is employed to ensure sound policy improvement in fully cooperative games, while magnetic mirror descent is used in adversarial settings to inhibit cyclic behavior typically observed in zero-sum games.

Empirical Evaluation and Results

The authors validate the proposed framework by implementing the mirror descent search (MDS) and magnetic mirror descent search (MMDS) and testing them in established environments such as Hanabi, Abrupt Dark Hex, and Phantom Tic-Tac-Toe. They observe that the MDS significantly improves performance over PBS-based methods, using two orders of magnitude less search time. Specifically in Hanabi, MDS equaled or surpassed the performance of state-of-the-art PBS methods, highlighting the practical benefits of the update-equivalence framework.

In adversarial contexts like Abrupt Dark Hex, MMDS reduces approximate exploitability effectively, reflecting its scalable nature in games with narrow public information scope. These empirical results provide substantial evidence in favor of the update-equivalence framework as a viable alternative for decision-time planning in various game-theoretic scenarios.

Practical and Theoretical Implications

Practically, the update-equivalence framework facilitates decision-time planning in scenarios with large non-public information, a domain where current PBS methods are less effective. This framework also offers a straightforward approach to proving the soundness of DTP algorithms based on established last-iterate improvements, thus marrying theoretical robustness with empirical efficacy.

Theoretically, the framework provides a new lens through which DTP can be viewed, moving away from complex PBS structures to more straightforward update mechanisms. This potentially opens doors to new research avenues in strategic decision-making and expert systems, with implications for other fields where planning under uncertainty is crucial.

Speculative Future Developments

Future research might explore the integration of update-equivalence-inspired algorithms in broader applications beyond games, including multi-agent systems and real-world decision-making scenarios. There's potential for this approach to synergize with reinforcement learning techniques, offering robust solutions to static and dynamic points in complex environments.

In conclusion, the update-equivalence framework presents a compelling direction for decision-time planning, providing actionable strategies for efficiently tackling games with sensitive private information. The combination of theoretical rigor and confirmed practical performance positions this approach as a promising advance in the field, with broad implications for both AI research and applied strategic problem-solving.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ssokota/status/1772932259839934915

https://twitter.com/ssokota/status/1772933236651368706

https://twitter.com/mttrdmnd/status/1794169512008544763

https://twitter.com/sawubonagmbh/status/1757073222653399538