Abstract
In this paper we consider the problem of computing an ε-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in O(1) time. Given such a DMDP with states S, actions A, discount factor γ ∈ (0, 1), and rewards in range [0, 1] we provide an algorithm which computes an ε-optimal policy with probability 1 − δ where both the time spent and number of sample taken are upper bounded by O(1|S||A|− γ)32 log(1|S||A|− γ)δ log(1 −1γ) . For fixed values of ε, this improves upon the previous best known bounds by a factor of (1 − γ)−1 and matches the sample complexity lower bounds proved in [AMK13] up to logarithmic factors. We also extend our method to computing ε-optimal policies for finite-horizon MDP with a generative model and provide a nearly matching sample complexity lower bound.
Original language | English (US) |
---|---|
Pages (from-to) | 5186-5196 |
Number of pages | 11 |
Journal | Advances in Neural Information Processing Systems |
Volume | 2018-December |
State | Published - 2018 |
Event | 32nd Conference on Neural Information Processing Systems, NeurIPS 2018 - Montreal, Canada Duration: Dec 2 2018 → Dec 8 2018 |
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications
- Information Systems
- Signal Processing