Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 14 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 179 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process (1906.09114v2)

Published 20 Jun 2019 in cs.LG, cs.AI, cs.GT, and stat.ML

Abstract: We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number $D$. An MDP consists of $S$ states and $A$ possible actions per state. Upon choosing an action $a_t$ at state $s_t$, one receives a real value reward $r_t$, then one transits to a next state $s_{t+1}$. The reward $r_t$ is generated from a fixed reward distribution depending only on $(s_t, a_t)$ and similarly, the next state $s_{t+1}$ is generated from a fixed transition distribution depending only on $(s_t, a_t)$. The objective is to maximize the accumulated rewards after $T$ interactions. In this paper, we consider the case where the reward distributions, the transitions, $T$ and $D$ are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order $\tilde{\mathcal{O}}(\sqrt{DSAT})$. Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.