Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor (1008.0530v1)

Published 3 Aug 2010 in cs.GT

Abstract: Ye showed recently that the simplex method with Dantzig pivoting rule, as well as Howard's policy iteration algorithm, solve discounted Markov decision processes (MDPs), with a constant discount factor, in strongly polynomial time. More precisely, Ye showed that both algorithms terminate after at most $O(\frac{mn}{1-\gamma}\log(\frac{n}{1-\gamma}))$ iterations, where $n$ is the number of states, $m$ is the total number of actions in the MDP, and $0<\gamma<1$ is the discount factor. We improve Ye's analysis in two respects. First, we improve the bound given by Ye and show that Howard's policy iteration algorithm actually terminates after at most $O(\frac{m}{1-\gamma}\log(\frac{n}{1-\gamma}))$ iterations. Second, and more importantly, we show that the same bound applies to the number of iterations performed by the strategy iteration (or strategy improvement) algorithm, a generalization of Howard's policy iteration algorithm used for solving 2-player turn-based stochastic games with discounted zero-sum rewards. This provides the first strongly polynomial algorithm for solving these games, resolving a long standing open problem.

Citations (165)

View on Semantic Scholar

Summary

The paper demonstrates that strategy iteration for 2-player turn-based stochastic games terminates in strongly polynomial time.
It extends Howard’s policy iteration from MDPs to more complex bi-player settings using refined combinatorial and algebraic techniques.
The results offer significant insights into algorithmic game theory and pave the way for advanced methods in decision-making under uncertainty.

Strategy Iteration and Its Polynomial Complexity in 2-Player Turn-Based Stochastic Games

The paper "Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor" offers a rigorous analysis of the strategy iteration algorithm and its complexity when applied to a specific class of stochastic games. As the title suggests, the authors—Hansen, Miltersen, and Zwick—demonstrably improve the bounds on iterations for strategy iteration, focusing on two-player turn-based stochastic games with a constant discount factor. This resolves a long-standing open problem in algorithmic game theory.

Background and Context

The research pivots around studying Markov Decision Processes (MDPs) and their generalization to Stochastic Games (SGs), specifically focusing on the latter form with a two-player setup. These games are used to model decision-making under uncertainty with adversarial elements, incorporating the roles of two players with opposing objectives—one aims to minimize expected costs, while the other seeks to maximize them. The games considered are infinite-horizon with discounted criteria, known as 2-player Turn-Based Stochastic Games (2TBSGs).

Notably, Ye's recent work on MDPs demonstrated that Howard's policy iteration algorithm and the simplex method with Dantzig's pivoting rule solve discounted MDPs in strongly polynomial time. This paper leverages and refines those insights by extending the improved bounds to strategy iteration for 2TBSGs, a more complex setting due to the bi-player interaction.

Main Contributions

Improved Iteration Bounds: The paper advances earlier results by showing that the termination of Howard's policy iteration algorithm occurs within $O\bigl(\frac{m}{1-\gamma}\log\bigl(\frac{n}{1-\gamma}\bigr)\bigr)$ iterations for MDPs, where $m$ represents actions, $n$ is the number of states, and $\gamma$ is the discount factor.
Extension to Strategy Iteration: Extending the analysis to 2TBSGs, the paper illustrates that these games can be solved in strongly polynomial time using strategy iteration with the same bound. This marks the first strongly polynomial algorithm for this class of games, transcending MDPs into the richer domain of 2TBSGs.
Game-Theoretic Insight: By establishing relations akin to those used by Ye but without relying on linear programming (LP) fundamentals—since no succinct LP formulation for 2TBSGs exists—the paper navigates through game-theoretic constructs. It cleverly adapts notions of LP duality and complementary slackness to this broader context, uncovering algorithmic properties that could aid future research in non-linear optimization settings beyond game theory.

Methodological Approach

The authors take a stepwise approach leveraging combinatorial and algebraic insights. They derive bounds using a blend of algorithmic techniques grounded in policy and value variation operators. The strategy extraction processes form the crux of upgrading strategies iteratively, proving optimal policy networks through rigorous induction, flux vector analysis, and contraction mapping principles borrowed from the theory of MDPs.

Implications and Future Directions

The results are impactful, resolving a question open for decades within computational game theory while opening pathways for exploring further games with more diverse setups or less constrained conditions (non-constant discount factors). It draws attention to the role of game-theoretic quantities analogous to LP-based measures in simulating algorithms like the simplex or interior point methods for two-player stochastic interaction models.

Moreover, the findings have significant implications for computational economics, AI game-solving frameworks, and automated decision-making systems that rely on optimal strategy deployment over uncertain environments. Future studies might delve into extending these techniques to higher-player games, multiplayer settings with concurrent game structures, or adopting similar constructs in unrelated fields, such as network theory or mechanism design, where stochastic externalities provide challenges akin to those in 2TBSGs.