Solving infinite-horizon POMDPs with memoryless stochastic policies in state-action space

Published 27 May 2022 in cs.LG, cs.SY, eess.SY, and math.OC | (2205.14098v1)

Abstract: Reward optimization in fully observable Markov decision processes is equivalent to a linear program over the polytope of state-action frequencies. Taking a similar perspective in the case of partially observable Markov decision processes with memoryless stochastic policies, the problem was recently formulated as the optimization of a linear objective subject to polynomial constraints. Based on this we present an approach for Reward Optimization in State-Action space (ROSA). We test this approach experimentally in maze navigation tasks. We find that ROSA is computationally efficient and can yield stability improvements over other existing methods.