I. Introduction
Stochastic games, introduced in [1], find wide applications in economics [2], evolutionary biology [3], [4], political science [5], operations research [6], etc. In a stochastic game, players’ actions determine not only the current payoff they receive but also the transition probability to the next stage of the game. Their goal is to find strategies, a map of the state to action, to maximize their payoffs. A common solution to this problem is the celebrated Nash equilibrium where the players have no incentive to change their strategies given other players play at this equilibrium. Finding the Nash equilibrium of a stochastic game requires solving a fixed-point Bellman equation involving a minimax problem [1]. One of the methods of solving such an equation is policy iteration (PI), where the policies of the players are updated iteratively, involving policy evaluation and improvement steps [7]–[11]. PI methods play a fundamental role in the recent development of data-driven learning algorithms, such as decentralized Q-learning [12], [13], for solving stochastic games.