## Abstract

We pose the question of whether it is possible to design a simple, linear-time algorithm for the basic multi-armed bandit problem in the adversarial setting which has a regret bound of O(√Q logT), where Q is the total quadratic variation of all the arms. We are interested in the fundamental multi-armed bandit (MAB) problem: iteratively at times t = 1,2,..., T the decision maker has to choose (possibly randomly) one of n arms, it, and then receives the payoff of the arm, assumed to be in the range [0, 1]. The payoffs are constructed adversarially, as in (Auer et al., 2003), and we denote the payoff at time t for arm i by f_{t}{i) ∈ [0,1]. The decision maker can only see her own payoff, and does not have access to the entire payoff vector f_{t} (otherwise it would have been the usual "experts" problem"). The goal is to minimize regret:

Original language | English (US) |
---|---|

Pages (from-to) | 817-819 |

Number of pages | 3 |

Journal | Journal of Machine Learning Research |

Volume | 19 |

State | Published - Jan 1 2011 |

Externally published | Yes |

Event | 24th International Conference on Learning Theory, COLT 2011 - Budapest, Hungary Duration: Jul 9 2011 → Jul 11 2011 |

## All Science Journal Classification (ASJC) codes

- Software
- Control and Systems Engineering
- Statistics and Probability
- Artificial Intelligence