Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 186 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Thompson Sampling for Contextual Bandits with Linear Payoffs (1209.3352v4)

Published 15 Sep 2012 in cs.LG, cs.DS, and stat.ML

Abstract: Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state-of-the-art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied versions of the contextual bandits problem. We provide the first theoretical guarantees for the contextual version of Thompson Sampling. We prove a high probability regret bound of $\tilde{O}(d{3/2}\sqrt{T})$ (or $\tilde{O}(d\sqrt{T \log(N)})$), which is the best regret bound achieved by any computationally efficient algorithm available for this problem in the current literature, and is within a factor of $\sqrt{d}$ (or $\sqrt{\log(N)}$) of the information-theoretic lower bound for this problem.

Citations (953)

Summary

  • The paper establishes the first theoretical guarantees for Thompson Sampling in contextual bandits with linear payoffs, achieving regret bounds near the information-theoretic limit.
  • It employs a Bayesian framework with Gaussian priors and novel martingale-based techniques to update posterior distributions effectively.
  • The results offer practical benefits for online advertising, personalized recommendations, and financial trading by ensuring computational efficiency in large-scale decision-making.

Thompson Sampling for Contextual Bandits with Linear Payoffs: An Overview

Introduction

The paper "Thompson Sampling for Contextual Bandits with Linear Payoffs" authored by Shipra Agrawal and Navin Goyal examines an extension of the Thompson Sampling (TS) algorithm for the contextual multi-armed bandit (MAB) problem. This paper is pivotal as it presents the first theoretical performance guarantees for TS in the contextual bandit setting with linear payoffs under an adaptive adversary. The authors emphasize the theoretical contributions by providing a high probability regret bound of O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}), which is close to the information-theoretic lower bound and is the best bound achieved by any computationally efficient algorithm for the problem to date.

Problem Setting

The contextual multi-armed bandit problem involves making sequential decisions over TT rounds, where in each round, the learner is presented with NN actions (or arms), each associated with a dd-dimensional context vector. The objective is to maximize cumulative rewards by judiciously balancing the exploration-exploitation trade-off. The focus here is on linear payoff functions, where the expected reward for selecting an arm is a linear function of its context vector. The learner’s challenge is to compete against the predictions of the best linear predictor in hindsight.

Algorithmic Design

The proposed variant of TS for the contextual bandit problem is built upon the Bayesian framework. The algorithm essentially works by maintaining a posterior distribution over the parameters of the arms' reward distributions and updating this distribution as observations are gathered. The Thompson Sampling approach is implemented using Gaussian prior and likelihood functions due to their natural mathematical properties. In each round, an arm's parameter is sampled from its posterior distribution, and the arm with the maximum expected reward is chosen.

Specifically, the parameters include:

  • A set Θ\Theta of parameters μ~\tilde{\mu}.
  • A prior distribution P(μ~)P(\tilde{\mu}).
  • Past observations D\cal D.
  • A likelihood function P(rb,μ~)P(r|b, \tilde{\mu}).
  • A posterior distribution updated using Bayes' theorem.

Regret Analysis

The regret analysis of the proposed TS algorithm is one of the paper's most significant contributions. The regret bound derived is O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}), with a more refined bound of O~(dTlog(N))\tilde{O}(d\sqrt{T \log(N)}) when appropriate. These bounds are within a factor of d\sqrt{d} of the information-theoretic lower bound Ω(dT)\Omega(d\sqrt{T}) for the contextual linear bandit problem. The analysis leverages novel martingale-based techniques to handle the complexities introduced by the contextual nature of the bandit problem and the adaptive adversarial setting.

Key theoretical results include:

  • Establishing high probability bounds on the variance of estimates.
  • Proving concentration inequalities for the posterior distributions.
  • Utilizing these inequalities to derive the overall regret bound.

Practical Implications and Future Directions

The findings have significant practical applications, particularly in fields such as online advertising, personalized recommendations, and financial trading, where efficient decision-making under uncertainty is crucial. The efficiency of the TS algorithm in terms of computational complexity (running in polynomial time with respect to dd) makes it a robust choice for large-scale industrial applications.

Future research directions might explore:

  • Extending the algorithm and analysis to other contextual bandit settings, such as those involving non-linear payoff functions.
  • Investigating the impact of delayed and batched feedback, which are common in many real-world applications.
  • Analyzing the performance of TS in the agnostic setting, where the realizability assumption is relaxed.

Conclusion

Shipra Agrawal and Navin Goyal's work provides a comprehensive theoretical foundation for using Thompson Sampling in the contextual bandit setting with linear payoffs. The introduction of martingale-based techniques in the regret analysis sets a precedent for future research in this domain. The strong regret bounds illustrated in the paper reinforce Thompson Sampling's utility, bridging the gap between theoretical guarantees and practical performance.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube