Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 56 tok/s
Gemini 2.5 Pro 39 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 155 tok/s Pro
GPT OSS 120B 476 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Stochastic Contextual Bandits with Known Reward Functions (1605.00176v2)

Published 30 Apr 2016 in cs.LG

Abstract: Many sequential decision-making problems in communication networks can be modeled as contextual bandit problems, which are natural extensions of the well-known multi-armed bandit problem. In contextual bandit problems, at each time, an agent observes some side information or context, pulls one arm and receives the reward for that arm. We consider a stochastic formulation where the context-reward tuples are independently drawn from an unknown distribution in each trial. Motivated by networking applications, we analyze a setting where the reward is a known non-linear function of the context and the chosen arm's current state. We first consider the case of discrete and finite context-spaces and propose DCB($\epsilon$), an algorithm that we prove, through a careful analysis, yields regret (cumulative reward gap compared to a distribution-aware genie) scaling logarithmically in time and linearly in the number of arms that are not optimal for any context, improving over existing algorithms where the regret scales linearly in the total number of arms. We then study continuous context-spaces with Lipschitz reward functions and propose CCB($\epsilon, \delta$), an algorithm that uses DCB($\epsilon$) as a subroutine. CCB($\epsilon, \delta$) reveals a novel regret-storage trade-off that is parametrized by $\delta$. Tuning $\delta$ to the time horizon allows us to obtain sub-linear regret bounds, while requiring sub-linear storage. By exploiting joint learning for all contexts we get regret bounds for CCB($\epsilon, \delta$) that are unachievable by any existing contextual bandit algorithm for continuous context-spaces. We also show similar performance bounds for the unknown horizon case.

Citations (13)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.