Near-optimal time and sample complexities for solving Markov decision processes with a generative model

Aaron Sidford, Mengdi Wang, Xian Wu, Lin F. Yang, Yinyu Ye

Research output: Contribution to journalConference articlepeer-review

104 Scopus citations


In this paper we consider the problem of computing an ε-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in O(1) time. Given such a DMDP with states S, actions A, discount factor γ ∈ (0, 1), and rewards in range [0, 1] we provide an algorithm which computes an ε-optimal policy with probability 1 − δ where both the time spent and number of sample taken are upper bounded by O(1|S||A|− γ)32 log(1|S||A|− γ log(1 −1γ) . For fixed values of ε, this improves upon the previous best known bounds by a factor of (1 − γ)1 and matches the sample complexity lower bounds proved in [AMK13] up to logarithmic factors. We also extend our method to computing ε-optimal policies for finite-horizon MDP with a generative model and provide a nearly matching sample complexity lower bound.

Original languageEnglish (US)
Pages (from-to)5186-5196
Number of pages11
JournalAdvances in Neural Information Processing Systems
StatePublished - 2018
Event32nd Conference on Neural Information Processing Systems, NeurIPS 2018 - Montreal, Canada
Duration: Dec 2 2018Dec 8 2018

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing


Dive into the research topics of 'Near-optimal time and sample complexities for solving Markov decision processes with a generative model'. Together they form a unique fingerprint.

Cite this