Contextual bandit algorithms with supervised learning guarantees

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, Robert E. Schapire

Research output: Contribution to journalConference articlepeer-review

134 Scopus citations


We address the problem of competing with any large set of N policies in the non-stochastic bandit setting, where the learner must repeatedly select among K actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of Auer et al. [2], called Exp4.P, which with high probability incurs regret at most O(√KT lnN). Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension d while incurring regret at most O(√Td ln T) with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning.

Original languageEnglish (US)
Pages (from-to)19-26
Number of pages8
JournalJournal of Machine Learning Research
StatePublished - 2011
Event14th International Conference on Artificial Intelligence and Statistics, AISTATS 2011 - Fort Lauderdale, FL, United States
Duration: Apr 11 2011Apr 13 2011

All Science Journal Classification (ASJC) codes

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence


Dive into the research topics of 'Contextual bandit algorithms with supervised learning guarantees'. Together they form a unique fingerprint.

Cite this