Abstract
We pose the question of whether it is possible to design a simple, linear-time algorithm for the basic multi-armed bandit problem in the adversarial setting which has a regret bound of O(√Q logT), where Q is the total quadratic variation of all the arms. We are interested in the fundamental multi-armed bandit (MAB) problem: iteratively at times t = 1,2,..., T the decision maker has to choose (possibly randomly) one of n arms, it, and then receives the payoff of the arm, assumed to be in the range [0, 1]. The payoffs are constructed adversarially, as in (Auer et al., 2003), and we denote the payoff at time t for arm i by ft{i) ∈ [0,1]. The decision maker can only see her own payoff, and does not have access to the entire payoff vector ft (otherwise it would have been the usual "experts" problem"). The goal is to minimize regret:
Original language | English (US) |
---|---|
Pages (from-to) | 817-819 |
Number of pages | 3 |
Journal | Journal of Machine Learning Research |
Volume | 19 |
State | Published - 2011 |
Externally published | Yes |
Event | 24th International Conference on Learning Theory, COLT 2011 - Budapest, Hungary Duration: Jul 9 2011 → Jul 11 2011 |
All Science Journal Classification (ASJC) codes
- Software
- Control and Systems Engineering
- Statistics and Probability
- Artificial Intelligence