Emergent Mind

The Extended UCB Policies for Frequentist Multi-armed Bandit Problems

(1112.1768)
Published Dec 8, 2011 in cs.LG , math.PR , math.ST , and stat.TH

Abstract

The multi-armed bandit (MAB) problem is a widely studied model in the field of reinforcement learning. This paper considers two cases of the classical MAB model -- the light-tailed reward distributions and the heavy-tailed, respectively. For the light-tailed (i.e. sub-Gaussian) case, we propose the UCB1-LT policy, achieving the optimal $O(\log T)$ of the order of regret growth. For the heavy-tailed case, we introduce the extended robust UCB policy, which is an extension of the UCB policies proposed by Bubeck et al. (2013) and Lattimore (2017). The previous UCB policies require the knowledge of an upper bound on specific moments of reward distributions, which can be hard to acquire in some practical situations. Our extended robust UCB eliminates this requirement while still achieving the optimal regret growth order $O(\log T)$, thus providing a broadened application area of the UCB policies for the heavy-tailed reward distributions.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.