Abstract
We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the state-of-the-art forward in two ways: First, it attains a regret of O(T2/3) with respect to the best fixed policy in hindsight, whereas the previous best regret bound was O(T3/4). Second, the algorithm and its analysis are compatible with any feasible ADMDP graph topology, while all previous approaches required additional restrictions on the graph topology.
Original language | English (US) |
---|---|
Pages | 1712-1720 |
Number of pages | 9 |
State | Published - 2013 |
Externally published | Yes |
Event | 30th International Conference on Machine Learning, ICML 2013 - Atlanta, GA, United States Duration: Jun 16 2013 → Jun 21 2013 |
Other
Other | 30th International Conference on Machine Learning, ICML 2013 |
---|---|
Country/Territory | United States |
City | Atlanta, GA |
Period | 6/16/13 → 6/21/13 |
All Science Journal Classification (ASJC) codes
- Human-Computer Interaction
- Sociology and Political Science