Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 44 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 13 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

First-Order Bayesian Regret Analysis of Thompson Sampling (1902.00681v3)

Published 2 Feb 2019 in cs.LG and stat.ML

Abstract: We address online combinatorial optimization when the player has a prior over the adversary's sequence of losses. In this framework, Russo and Van Roy proposed an information-theoretic analysis of Thompson Sampling based on the information ratio, resulting in optimal worst-case regret bounds. In this paper we introduce three novel ideas to this line of work. First we propose a new quantity, the scale-sensitive information ratio, which allows us to obtain more refined first-order regret bounds (i.e., bounds of the form $\sqrt{L*}$ where $L*$ is the loss of the best combinatorial action). Second we replace the entropy over combinatorial actions by a coordinate entropy, which allows us to obtain the first optimal worst-case bound for Thompson Sampling in the combinatorial setting. Finally, we introduce a novel link between Bayesian agents and frequentist confidence intervals. Combining these ideas we show that the classical multi-armed bandit first-order regret bound $\tilde{O}(\sqrt{d L*})$ still holds true in the more challenging and more general semi-bandit scenario. This latter result improves the previous state of the art bound $\tilde{O}(\sqrt{(d+m3)L*})$ by Lykouris, Sridharan and Tardos. Moreover we sharpen these results with two technical ingredients. The first leverages a recent insight of Zimmert and Lattimore to replace Shannon entropy with more refined potential functions in the analysis. The second is a \emph{Thresholded} Thompson sampling algorithm, which slightly modifies the original algorithm by never playing low-probability actions. This thresholding results in fully $T$-independent regret bounds when $L*$ is almost surely upper-bounded, which we show does not hold for ordinary Thompson sampling.

Citations (16)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.