### Abstract

We propose a novel randomized linear programming algorithm for approximating the optimal policy of the discounted-reward and average-reward Markov decision problems. By leveraging the value-policy duality, the algorithm adaptively samples state-action-state transitions and makes exponentiated primal-dual updates. We show that it ﬁnds an ε-optimal policy using nearly linear runtime in the worst case for a ﬁxed value of the discount factor. When the Markov decision process is ergodic and speciﬁed in some special data formats, for ﬁxed values of certain ergodicity parameters, the algorithm ﬁnds an ε-optimal policy using sample size and time linear in the total number of state-action pairs, which is sublinear in the input size. These results provide a new venue and complexity benchmarks for solving stochastic dynamic programs.

Original language | English (US) |
---|---|

Pages (from-to) | 517-546 |

Number of pages | 30 |

Journal | Mathematics of Operations Research |

Volume | 45 |

Issue number | 2 |

DOIs | |

State | Published - May 2020 |

### All Science Journal Classification (ASJC) codes

- Mathematics(all)
- Computer Science Applications
- Management Science and Operations Research

### Keywords

- Duality
- Linear programming
- Markov decision process
- Primal-dual method
- Randomized algorithm
- Runtime complexity
- Stochastic approximation