Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 28 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 471 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Efficient and Adaptive Posterior Sampling Algorithms for Bandits (2405.01010v1)

Published 2 May 2024 in cs.LG and stat.ML

Abstract: We study Thompson Sampling-based algorithms for stochastic bandits with bounded rewards. As the existing problem-dependent regret bound for Thompson Sampling with Gaussian priors [Agrawal and Goyal, 2017] is vacuous when $T \le 288 e{64}$, we derive a more practical bound that tightens the coefficient of the leading term %from $288 e{64}$ to $1270$. Additionally, motivated by large-scale real-world applications that require scalability, adaptive computational resource allocation, and a balance in utility and computation, we propose two parameterized Thompson Sampling-based algorithms: Thompson Sampling with Model Aggregation (TS-MA-$\alpha$) and Thompson Sampling with Timestamp Duelling (TS-TD-$\alpha$), where $\alpha \in [0,1]$ controls the trade-off between utility and computation. Both algorithms achieve $O \left(K\ln{\alpha+1}(T)/\Delta \right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $\Delta$ denotes the single round performance loss when pulling a sub-optimal arm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. Near-optimal regret bounds for Thompson Sampling. http://www.columbia.edu/~sa3305/papers/j3-corrected.pdf, 2017.
  2. Tuning bandit algorithms in stochastic environments. In International conference on algorithmic learning theory, pages 150–165. Springer, 2007.
  3. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010.
  4. Finite-time analysis of the multi-armed bandit problem. Machine learning, 47:235–256, 2002.
  5. From optimality to robustness: Adaptive re-sampling strategies in stochastic bandits. Advances in Neural Information Processing Systems, 34:14029–14041, 2021.
  6. Maillard sampling: Boltzmann exploration done optimally. In International Conference on Artificial Intelligence and Statistics, pages 54–72. PMLR, 2022.
  7. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pages 359–376. JMLR Workshop and Conference Proceedings, 2011.
  8. An asymptotically optimal bandit algorithm for bounded support models. In COLT, pages 67–79. Citeseer, 2010.
  9. Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. J. Mach. Learn. Res., 16:3721–3756, 2015.
  10. MOTS: Minimax optimal Thompson Sampling. In International Conference on Machine Learning, pages 5074–5083. PMLR, 2021.
  11. Finite-time regret of Thompson Sampling algorithms for exponential family multi-armed bandits. Advances in Neural Information Processing Systems, 35:38475–38487, 2022.
  12. Thompson Sampling with less exploration is fast and optimal. 2023.
  13. On Bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, pages 592–600. PMLR, 2012a.
  14. Thompson Sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory: 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings 23, pages 199–213. Springer, 2012b.
  15. Tor Lattimore. Refining the confidence level for optimistic bandit strategies. The Journal of Machine Learning Research, 19(1):765–796, 2018.
  16. A minimax and asymptotically optimal algorithm for stochastic bandits. In International Conference on Algorithmic Learning Theory, pages 223–237. PMLR, 2017.
  17. Bandit algorithms based on thompson sampling for bounded reward distributions. In Algorithmic Learning Theory, pages 777–826. PMLR, 2020.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com