Smooth Non-Stationary Bandits (2301.12366v3)
Abstract: In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change {\em smoothly}, so such algorithms may incur higher-than-necessary regret. We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $\beta$-H\"older function, i.e., a function that is $(\beta-1)$-times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as $\beta$ increases. When $\beta=1$, this corresponds to the non-smooth regime, where \cite{besbes2014stochastic} established a minimax regret of $\tilde \Theta(T{2/3})$. We show the first separation between the smooth (i.e., $\beta\ge 2$) and non-smooth (i.e., $\beta=1$) regimes by presenting a policy with $\tilde O(k{4/5} T{3/5})$ regret on any $k$-armed, $2$-H\"older instance. We complement this result by showing that the minimax regret on the $\beta$-H\"older family of instances is $\Omega(T{(\beta+1)/(2\beta+1)})$ for any integer $\beta\ge 1$. This matches our upper bound for $\beta=2$ up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.