## Abstract

In this paper we consider the problem of computing an ε-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in O(1) time. Given such a DMDP with states S, actions A, discount factor γ ∈ (0, 1), and rewards in range [0, 1] we provide an algorithm which computes an ε-optimal policy with probability 1 − δ where both the time spent and number of sample taken are upper bounded by O_{(1}^{|S||A|}_{− γ}_{)32} log_{(1}^{|S||A|}_{− γ}_{)δ} log_{(1 −}^{1}_{γ}_{)} . For fixed values of ε, this improves upon the previous best known bounds by a factor of (1 − γ)^{−}^{1} and matches the sample complexity lower bounds proved in [AMK13] up to logarithmic factors. We also extend our method to computing ε-optimal policies for finite-horizon MDP with a generative model and provide a nearly matching sample complexity lower bound.

Original language | English (US) |
---|---|

Pages (from-to) | 5186-5196 |

Number of pages | 11 |

Journal | Advances in Neural Information Processing Systems |

Volume | 2018-December |

State | Published - Jan 1 2018 |

Event | 32nd Conference on Neural Information Processing Systems, NeurIPS 2018 - Montreal, Canada Duration: Dec 2 2018 → Dec 8 2018 |

## All Science Journal Classification (ASJC) codes

- Computer Networks and Communications
- Information Systems
- Signal Processing