Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Posterior sampling for reinforcement learning: worst-case regret bounds (1705.07041v3)

Published 19 May 2017 in cs.LG

Abstract: We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of $\tilde{O}(DS\sqrt{AT})$ for any communicating MDP with $S$ states, $A$ actions and diameter $D$. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon $T$. This result closely matches the known lower bound of $\Omega(\sqrt{DSAT})$. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shipra Agrawal (33 papers)
  2. Randy Jia (5 papers)
Citations (35)

Summary

We haven't generated a summary for this paper yet.