The Extended UCB Policies for Frequentist Multi-armed Bandit Problems (1112.1768v3)

Published 8 Dec 2011 in cs.LG, math.PR, math.ST, and stat.TH

Abstract: The multi-armed bandit (MAB) problem is a widely studied model in the field of operations research for sequential decision making and reinforcement learning. This paper mainly considers the classical MAB model with the heavy-tailed reward distributions. We introduce the extended robust UCB policy, which is an extension of the pioneering UCB policies proposed by Bubeck et al. [5] and Lattimore [21]. The previous UCB policies require the knowledge of an upper bound on specific moments of reward distributions or a particular moment to exist, which can be hard to acquire or guarantee in practical scenarios. Our extended robust UCB generalizes Lattimore's seminary work (for moments of orders $p=4$ and $q=2$) to arbitrarily chosen $p$ and $q$ as long as the two moments have a known controlled relationship, while still achieving the optimal regret growth order O(log T), thus providing a broadened application area of the UCB policies for the heavy-tailed reward distributions.

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

The Extended UCB Policies for Frequentist Multi-armed Bandit Problems (1112.1768v3)

Summary

Related Papers