Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Thompson Sampling for Contextual Bandits with Linear Payoffs (1209.3352v4)

Published 15 Sep 2012 in cs.LG, cs.DS, and stat.ML

Abstract: Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state-of-the-art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied versions of the contextual bandits problem. We provide the first theoretical guarantees for the contextual version of Thompson Sampling. We prove a high probability regret bound of $\tilde{O}(d{3/2}\sqrt{T})$ (or $\tilde{O}(d\sqrt{T \log(N)})$), which is the best regret bound achieved by any computationally efficient algorithm available for this problem in the current literature, and is within a factor of $\sqrt{d}$ (or $\sqrt{\log(N)}$) of the information-theoretic lower bound for this problem.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shipra Agrawal (33 papers)
  2. Navin Goyal (42 papers)
Citations (953)

Summary

  • The paper establishes the first theoretical guarantees for Thompson Sampling in contextual bandits with linear payoffs, achieving regret bounds near the information-theoretic limit.
  • It employs a Bayesian framework with Gaussian priors and novel martingale-based techniques to update posterior distributions effectively.
  • The results offer practical benefits for online advertising, personalized recommendations, and financial trading by ensuring computational efficiency in large-scale decision-making.

Thompson Sampling for Contextual Bandits with Linear Payoffs: An Overview

Introduction

The paper "Thompson Sampling for Contextual Bandits with Linear Payoffs" authored by Shipra Agrawal and Navin Goyal examines an extension of the Thompson Sampling (TS) algorithm for the contextual multi-armed bandit (MAB) problem. This paper is pivotal as it presents the first theoretical performance guarantees for TS in the contextual bandit setting with linear payoffs under an adaptive adversary. The authors emphasize the theoretical contributions by providing a high probability regret bound of O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}), which is close to the information-theoretic lower bound and is the best bound achieved by any computationally efficient algorithm for the problem to date.

Problem Setting

The contextual multi-armed bandit problem involves making sequential decisions over TT rounds, where in each round, the learner is presented with NN actions (or arms), each associated with a dd-dimensional context vector. The objective is to maximize cumulative rewards by judiciously balancing the exploration-exploitation trade-off. The focus here is on linear payoff functions, where the expected reward for selecting an arm is a linear function of its context vector. The learner’s challenge is to compete against the predictions of the best linear predictor in hindsight.

Algorithmic Design

The proposed variant of TS for the contextual bandit problem is built upon the Bayesian framework. The algorithm essentially works by maintaining a posterior distribution over the parameters of the arms' reward distributions and updating this distribution as observations are gathered. The Thompson Sampling approach is implemented using Gaussian prior and likelihood functions due to their natural mathematical properties. In each round, an arm's parameter is sampled from its posterior distribution, and the arm with the maximum expected reward is chosen.

Specifically, the parameters include:

  • A set Θ\Theta of parameters μ~\tilde{\mu}.
  • A prior distribution P(μ~)P(\tilde{\mu}).
  • Past observations D\cal D.
  • A likelihood function P(rb,μ~)P(r|b, \tilde{\mu}).
  • A posterior distribution updated using Bayes' theorem.

Regret Analysis

The regret analysis of the proposed TS algorithm is one of the paper's most significant contributions. The regret bound derived is O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}), with a more refined bound of O~(dTlog(N))\tilde{O}(d\sqrt{T \log(N)}) when appropriate. These bounds are within a factor of d\sqrt{d} of the information-theoretic lower bound Ω(dT)\Omega(d\sqrt{T}) for the contextual linear bandit problem. The analysis leverages novel martingale-based techniques to handle the complexities introduced by the contextual nature of the bandit problem and the adaptive adversarial setting.

Key theoretical results include:

  • Establishing high probability bounds on the variance of estimates.
  • Proving concentration inequalities for the posterior distributions.
  • Utilizing these inequalities to derive the overall regret bound.

Practical Implications and Future Directions

The findings have significant practical applications, particularly in fields such as online advertising, personalized recommendations, and financial trading, where efficient decision-making under uncertainty is crucial. The efficiency of the TS algorithm in terms of computational complexity (running in polynomial time with respect to dd) makes it a robust choice for large-scale industrial applications.

Future research directions might explore:

  • Extending the algorithm and analysis to other contextual bandit settings, such as those involving non-linear payoff functions.
  • Investigating the impact of delayed and batched feedback, which are common in many real-world applications.
  • Analyzing the performance of TS in the agnostic setting, where the realizability assumption is relaxed.

Conclusion

Shipra Agrawal and Navin Goyal's work provides a comprehensive theoretical foundation for using Thompson Sampling in the contextual bandit setting with linear payoffs. The introduction of martingale-based techniques in the regret analysis sets a precedent for future research in this domain. The strong regret bounds illustrated in the paper reinforce Thompson Sampling's utility, bridging the gap between theoretical guarantees and practical performance.