A Time and Space Efficient Algorithm for Contextual Linear Bandits (1207.3024v4)

Published 12 Jul 2012 in cs.DS and cs.GT

Abstract: We consider a multi-armed bandit problem where payoffs are a linear function of an observed stochastic contextual variable. In the scenario where there exists a gap between optimal and suboptimal rewards, several algorithms have been proposed that achieve $O(\log T)$ regret after $T$ time steps. However, proposed methods either have a computation complexity per iteration that scales linearly with $T$ or achieve regrets that grow linearly with the number of contexts $|\myset{X}|$. We propose an $\epsilon$-greedy type of algorithm that solves both limitations. In particular, when contexts are variables in $\reals^d$, we prove that our algorithm has a constant computation complexity per iteration of $O(poly(d))$ and can achieve a regret of $O(poly(d) \log T)$ even when $|\myset{X}| = \Omega (2^d) $. In addition, unlike previous algorithms, its space complexity scales like $O(Kd^2)$ and does not grow with $T$.

Citations (2)

View on Semantic Scholar