Asymptotic Randomised Control with applications to bandits (2010.07252v2)
Abstract: We consider a general multi-armed bandit problem with correlated (and simple contextual and restless) elements, as a relaxed control problem. By introducing an entropy regularisation, we obtain a smooth asymptotic approximation to the value function. This yields a novel semi-index approximation of the optimal decision process. This semi-index can be interpreted as explicitly balancing an exploration-exploitation trade-off as in the optimistic (UCB) principle where the learning premium explicitly describes asymmetry of information available in the environment and non-linearity in the reward function. Performance of the resulting Asymptotic Randomised Control (ARC) algorithm compares favourably well with other approaches to correlated multi-armed bandits.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.