A simple multi-armed bandit algorithm with optimal variation-bounded regret

Elad Hazan, Satyen Kale

Research output: Contribution to journalConference articlepeer-review

4 Scopus citations

Abstract

We pose the question of whether it is possible to design a simple, linear-time algorithm for the basic multi-armed bandit problem in the adversarial setting which has a regret bound of O(√Q logT), where Q is the total quadratic variation of all the arms. We are interested in the fundamental multi-armed bandit (MAB) problem: iteratively at times t = 1,2,..., T the decision maker has to choose (possibly randomly) one of n arms, it, and then receives the payoff of the arm, assumed to be in the range [0, 1]. The payoffs are constructed adversarially, as in (Auer et al., 2003), and we denote the payoff at time t for arm i by ft{i) ∈ [0,1]. The decision maker can only see her own payoff, and does not have access to the entire payoff vector ft (otherwise it would have been the usual "experts" problem"). The goal is to minimize regret:

Original languageEnglish (US)
Pages (from-to)817-819
Number of pages3
JournalJournal of Machine Learning Research
Volume19
StatePublished - Jan 1 2011
Externally publishedYes
Event24th International Conference on Learning Theory, COLT 2011 - Budapest, Hungary
Duration: Jul 9 2011Jul 11 2011

All Science Journal Classification (ASJC) codes

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint Dive into the research topics of 'A simple multi-armed bandit algorithm with optimal variation-bounded regret'. Together they form a unique fingerprint.

Cite this