Towards Tight Bounds on the Sample Complexity of Average-reward MDPs (2106.07046v1)

Published 13 Jun 2021 in cs.LG, cs.DS, and math.OC

Abstract: We prove new upper and lower bounds for sample complexity of finding an $\epsilon$-optimal policy of an infinite-horizon average-reward Markov decision process (MDP) given access to a generative model. When the mixing time of the probability transition matrix of all policies is at most $t_\mathrm{mix}$, we provide an algorithm that solves the problem using $\widetilde{O}(t_\mathrm{mix} \epsilon^{-3})$ (oblivious) samples per state-action pair. Further, we provide a lower bound showing that a linear dependence on $t_\mathrm{mix}$ is necessary in the worst case for any algorithm which computes oblivious samples. We obtain our results by establishing connections between infinite-horizon average-reward MDPs and discounted MDPs of possible further utility.

Authors (2)

Yujia Jin (24 papers)
Aaron Sidford (122 papers)

Citations (25)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

YouTube

Show All Videos

Towards Tight Bounds on the Sample Complexity of Average-reward MDPs (2106.07046v1)

Summary

Related Papers

YouTube