Abstract
We analyze the robustness of a knowledge gradient (KG) policy for the multi-armed bandit problem. The KG policy is based on a one-period look-ahead, which is known to underperform in other learning problems when the marginal value of information is non-concave. We present an adjustment that corrects for non-concavity and approximates a multi-step look-ahead, and compare its performance to the unadjusted KG policy and other heuristics. We provide guidance for determining when adjustment will improve performance, and when it is unnecessary. We present evidence suggesting that KG is generally robust in the multi-armed bandit setting, which argues in favour of KG as an alternative to index policies.
Original language | English (US) |
---|---|
Pages (from-to) | 1635-1644 |
Number of pages | 10 |
Journal | Procedia Computer Science |
Volume | 1 |
Issue number | 1 |
DOIs | |
State | Published - 2010 |
All Science Journal Classification (ASJC) codes
- General Computer Science
Keywords
- Bayesian learning
- Knowledge gradient
- Multi-armed bandit
- Optimal learning