Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs (2404.02108v2)
Abstract: We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $\tilde{\mathcal{O}}(T{2/3})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order $\tilde{\mathcal{O}}(\sqrt{T})$. These results significantly improve the state-of-the-art $\tilde{\mathcal{O}}(T{3/4})$ regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately $L$-smooth, a result that was previously assumed in earlier works.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.