TY - GEN
T1 - Context-lumpable stochastic bandits
AU - Lee, Chung Wei
AU - Liu, Qinghua
AU - Abbasi-Yadkori, Yasin
AU - Jin, Chi
AU - Lattimore, Tor
AU - Szepesvári, Csaba
N1 - Publisher Copyright:
© 2023 Neural information processing systems foundation. All rights reserved.
PY - 2023
Y1 - 2023
N2 - We consider a contextual bandit problem with S contexts and K actions. In each round t = 1, 2, ... the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into r ≤ min{S, K} groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an ε-optimal policy after using at most Oe(r(S + K)/ε2) samples with high probability and provide a matching Ω(r(S + K)/ε2) lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time T is bounded by (Equation presented). To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and Oe(ppoly(r)(S + K)T) minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.
AB - We consider a contextual bandit problem with S contexts and K actions. In each round t = 1, 2, ... the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into r ≤ min{S, K} groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an ε-optimal policy after using at most Oe(r(S + K)/ε2) samples with high probability and provide a matching Ω(r(S + K)/ε2) lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time T is bounded by (Equation presented). To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and Oe(ppoly(r)(S + K)T) minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.
UR - https://www.scopus.com/pages/publications/85187416364
UR - https://www.scopus.com/pages/publications/85187416364#tab=citedBy
M3 - Conference contribution
AN - SCOPUS:85187416364
T3 - Advances in Neural Information Processing Systems
BT - Advances in Neural Information Processing Systems 36 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
A2 - Oh, A.
A2 - Neumann, T.
A2 - Globerson, A.
A2 - Saenko, K.
A2 - Hardt, M.
A2 - Levine, S.
PB - Neural information processing systems foundation
T2 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
Y2 - 10 December 2023 through 16 December 2023
ER -