Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 14 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 179 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Thompson Sampling in Partially Observable Contextual Bandits (2402.10289v1)

Published 15 Feb 2024 in stat.ML and cs.LG

Abstract: Contextual bandits constitute a classical framework for decision-making under uncertainty. In this setting, the goal is to learn the arms of highest reward subject to contextual information, while the unknown reward parameters of each arm need to be learned by experimenting that specific arm. Accordingly, a fundamental problem is that of balancing exploration (i.e., pulling different arms to learn their parameters), versus exploitation (i.e., pulling the best arms to gain reward). To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partial context observations remains unexplored to date, despite being theoretically more general and practically more versatile. We study bandit policies for learning to select optimal arms based on the data of observations, which are noisy linear functions of the unobserved context vectors. Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation. Specifically, we establish the followings: (i) regret bounds that grow poly-logarithmically with time, (ii) square-root consistency of parameter estimation, and (iii) scaling of the regret with other quantities including dimensions and number of arms. Extensive numerical experiments with both real and synthetic data are presented as well, corroborating the efficacy of Thompson sampling. To establish the results, we introduce novel martingale techniques and concentration inequalities to address partially observed dependent random variables generated from unspecified distributions, and also leverage problem-dependent information to sharpen probabilistic bounds for time-varying suboptimality gaps. These techniques pave the road towards studying other decision-making problems with contextual information as well as partial observations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312–2320, 2011.
  2. M. Abeille and A. Lazaric. Linear thompson sampling revisited. In Artificial Intelligence and Statistics, pages 176–184. PMLR, 2017.
  3. M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1964.
  4. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646. PMLR, 2014.
  5. S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127–135. PMLR, 2013.
  6. K. J. Åström. Optimal control of markov processes with incomplete state information. Journal of mathematical analysis and applications, 10(1):174–205, 1965.
  7. P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  8. K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357–367, 1967.
  9. H. Bastani and M. Bayati. Online decision making with high-dimensional covariates. Operations Research, 68(1):276–294, 2020.
  10. Mostly exploration-free algorithms for contextual bandits. Management Science, 67(3):1329–1349, 2021.
  11. A. Bensoussan. Stochastic control of partially observable systems. Cambridge University Press, 2004.
  12. A contextual bandit bake-off. The Journal of Machine Learning Research, 22(1):5928–5976, 2021.
  13. Thompson sampling for high-dimensional sparse linear contextual bandits. In International Conference on Machine Learning, pages 3979–4008. PMLR, 2023.
  14. O. Chapelle and L. Li. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24:2249–2257, 2011.
  15. Stochastic linear optimization under bandit feedback. 2008.
  16. J. L. Doob. Stochastic processes, volume 10. New York Wiley, 1953.
  17. E. R. Dougherty. Digital image processing methods. CRC Press, 2020.
  18. Pg-ts: Improved thompson sampling for logistic contextual bandits. Advances in neural information processing systems, 31, 2018.
  19. Finite time identification in unstable linear systems. Automatica, 96:342–353, 2018.
  20. Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33:11478–11489, 2020.
  21. A. Ghosh and A. Sankararaman. Breaking the T𝑇\sqrt{T}square-root start_ARG italic_T end_ARG barrier: Instance-independent logarithmic regret in stochastic contextual linear bandits. In International Conference on Machine Learning, pages 7531–7549. PMLR, 2022.
  22. A. Goldenshluger and A. Zeevi. A linear response bandit problem. Stochastic Systems, 3(1):230–261, 2013.
  23. Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1):16–18, 2019.
  24. M. Guan and H. Jiang. Nonparametric stochastic contextual bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  25. N. Hamidi and M. Bayati. On worst-case regret of linear thompson sampling. arXiv preprint arXiv:2006.06790, 2020.
  26. D. Harville. Extension of the gauss-markov theorem to include the estimation of random effects. The Annals of Statistics, 4(2):384–395, 1976.
  27. S. T. Jose and S. Moothedath. Thompson sampling for stochastic bandits with noisy contexts: An information-theoretic regret analysis. arXiv preprint arXiv:2401.11565, 2024.
  28. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
  29. A lidar-based decision-making method for road boundary detection using multiple kalman filters. IEEE Transactions on Industrial Electronics, 59(11):4360–4368, 2012.
  30. Thompson sampling for partially observable linear-quadratic control. In 2023 American Control Conference (ACC), pages 4561–4568. IEEE, 2023.
  31. Contextual linear bandits under noisy features: Towards bayesian oracles. In International Conference on Artificial Intelligence and Statistics, pages 1624–1645. PMLR, 2023.
  32. Information directed sampling for linear partial monitoring. In Conference on Learning Theory, pages 2328–2369. PMLR, 2020.
  33. V. Krishnamurthy and B. Wahlberg. Partially observed markov decision process multiarmed bandits—structural results. Mathematics of Operations Research, 34(2):287–302, 2009.
  34. T. Lattimore. Minimax regret for partial monitoring: Infinite outcomes and rustichini’s regret. In Conference on Learning Theory, pages 1547–1575. PMLR, 2022.
  35. T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  36. Kalman filter decision systems for debris flow hazard assessment. Natural hazards, 60(3):1255–1266, 2012.
  37. I. Nagrath. Control systems engineering. New Age International, 2006.
  38. H. Park and M. K. S. Faradonbeh. Analysis of thompson sampling for partially observable contextual multi-armed bandits. IEEE Control Systems Letters, 6:2150–2155, 2021.
  39. H. Park and M. K. S. Faradonbeh. Efficient algorithms for learning to control bandits with unobserved contexts. IFAC-PapersOnLine, 55(12):383–388, 2022a.
  40. H. Park and M. K. S. Faradonbeh. Worst-case performance of greedy policies in bandits with imperfect context observations. In 2022 IEEE 61st Conference on Decision and Control (CDC), pages 1374–1379. IEEE, 2022b.
  41. Greedy algorithm almost dominates in smoothed contextual bandits. SIAM Journal on Computing, 52(2):487–524, 2023.
  42. G. K. Robinson. That blup is a good thing: the estimation of random effects. Statistical science, pages 15–32, 1991.
  43. W. Rudin et al. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1976.
  44. D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
  45. Best-of-both-worlds algorithms for partial monitoring. In International Conference on Algorithmic Learning Theory, pages 1484–1515. PMLR, 2023.
  46. N. Wanigasekara and C. Yu. Nonparametric contextual bandits in metric spaces with unknown metric. Advances in Neural Information Processing Systems, 32, 2019.
  47. Lasso guarantees for β𝛽\betaitalic_β-mixing heavy-tailed time series. Annals of statistics, 48(2), 2020.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel Thompson sampling approach for contextual bandits with noisy, partially observable contexts, achieving bounded regret.
  • It employs an observation model that samples from a hypothetical posterior distribution to update reward parameter estimates under incomplete context data.
  • The method demonstrates sample efficiency and consistency, validated by theoretical regret bounds and empirical comparisons with regression and Greedy algorithms.

Thompson Sampling in Partially Observable Contextual Bandits

Introduction

The paper "Thompson Sampling in Partially Observable Contextal Bandits" (2402.10289) introduces a novel approach toward decision-making in the contextual bandit framework, where context observations are not perfect. This work distinguishes itself by focusing on situations where only partial, noisy, or transformed observations of contextual information are available, highlighting the practical applicability in areas like robotics, image processing, and healthcare, where perfect information is rarely accessible.

The paper advances existing methodologies by investigating the Thompson sampling strategy within this partially observable context, providing not only theoretical insights but also practical implementations. Key contributions include demonstrating that Thompson sampling maintains effectiveness even with partial observability, evidenced by bounded regret and consistent parameter estimation.

Problem Formulation

The fundamental challenge addressed in this work is balancing exploration and exploitation when making decisions under uncertainty, given incomplete context information. The rewards follow a model where the reward of an arm at any time is governed by the inner product of an unobserved context vector and an arm-specific parameter vector. Formally, the observed rewards incorporate noise to account for the unobserved aspects of the contexts, expressed as:

ri(t)=xi(t)μi+εi(t)r_i(t) = x_i(t)^\top \mu_i + \varepsilon_i(t)

Where xi(t)x_i(t) is the latent context vector for arm ii, μi\mu_i is the unknown parameter vector for arm ii, and εi(t)\varepsilon_i(t) is the stochastic noise.

Proposed Method

The authors propose the implementation of Thompson sampling by adapting it to work effectively with partially observable contexts. This involves:

  1. Observation Model: The policy observes noisy linear transformations of context vectors: yi(t)=Axi(t)+ξi(t)y_i(t) = Ax_i(t) + \xi_i(t) where AA is an unknown sensing matrix and ξi(t)\xi_i(t) is noise.
  2. Hypothetical Posterior: To circumvent the lack of full context observability, the paper proposes using a hypothetical posterior distribution for the reward parameters based on the observed data, updating belief distributions through:

ri(t)N(yi(t)ηi,v2)r_i(t) \sim \mathcal{N}(y_i(t)^\top \eta_i, v^2)

where ηi=Dμi\eta_i = D^\top \mu_i, and DD is a transformation matrix.

  1. Algorithm Implementation: The Thompson sampling algorithm is executed by iteratively sampling from the posterior distribution and selecting the arm to maximize expected reward based on the sampled parameter.

Theoretical Analysis

Key theoretical results are derived, providing assurances about the method's performance:

  1. Regret Bound: A significant result is the derivation of a regret bound, growing poly-logarithmically with time, indicating efficient performance preservation even with partial observability. Figure 1

Figure 1

Figure 1: Plots of Regret(t)/(logt)2\mathrm{Regret}(t)/(\log t)^2 across various context dimensions dxd_x and dyd_y.

  1. Estimation Consistency: The estimate of ηi\eta_i achieves square-root consistency with respect to the number of times an arm is chosen, demonstrating robustness in learning the underlying parameters of the model.
  2. Sample Efficiency: Explicit bounds are provided for the minimum number of samples needed before achieving reliable estimates, along with conditions tailored for high-probability performance guarantees.

Experimental Evaluation

The efficacy of the proposal is validated against real-world data, illustrating scenarios of eye movement and EEG datasets showcasing the decision rates of Thompson sampling relative to a regression oracle. Figure 2

Figure 2: Plots of normalized estimation errors as a function of time for multiple arm scenarios.

Figure 3

Figure 3: Comparative regret trajectories between Thompson sampling and Greedy algorithms with varying arm counts.

Conclusion

This work enriches the bandit literature by demonstrating Thompson sampling's versatility in contexts with observation imperfections. The empirical and theoretical findings assert that Thompson sampling remains a competitive choice, offering robust performance under uncertainty. Future work could further scrutinize adaptive algorithms within more complex observation structures or leverage non-linear transformations to enrich the decision-making landscape in practical applications.

This synthesis of theoretical scaffolding and empirical substantiation establishes a robust foundation, poised to foster subsequent innovations in adaptive decision-making systems operating under informational constraints.

X Twitter Logo Streamline Icon: https://streamlinehq.com