TY - GEN

T1 - Better algorithms for benign bandits

AU - Hazan, Elad

AU - Kale, Satyen

N1 - Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.

PY - 2009

Y1 - 2009

N2 - The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. Perhaps the most general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some Euclidean space, and the cost functions are linear. Only recently an efficient algorithm attaining Õ(√T) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which obtains a regret bound of Õ(√Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling.

AB - The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. Perhaps the most general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some Euclidean space, and the cost functions are linear. Only recently an efficient algorithm attaining Õ(√T) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which obtains a regret bound of Õ(√Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling.

UR - http://www.scopus.com/inward/record.url?scp=70349128132&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70349128132&partnerID=8YFLogxK

U2 - 10.1137/1.9781611973068.5

DO - 10.1137/1.9781611973068.5

M3 - Conference contribution

AN - SCOPUS:70349128132

SN - 9780898716801

T3 - Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms

SP - 38

EP - 47

BT - Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms

PB - Association for Computing Machinery (ACM)

T2 - 20th Annual ACM-SIAM Symposium on Discrete Algorithms

Y2 - 4 January 2009 through 6 January 2009

ER -