TY - JOUR
T1 - Variance reduced value iteration and faster algorithms for solving Markov decision processes
AU - Sidford, Aaron
AU - Wang, Mengdi
AU - Wu, Xian
AU - Ye, Yinyu
N1 - Funding Information:
We would like to thank Daochen Wang for pointing out an error in the previous version of our work regarding additive terms in our sublinear complexity bounds and the range of for which these results hold. We would like to thank Lin Yang for his generous help, meticulous proofreading, constructive suggestions, and useful discussions. We would also like to thank Moses Charikar for his support and encouragement in pursuing this research topic, and Ben Van Roy for excellent courses on Markov decision processes. Xian Wu gratefully acknowledges funding from Stanford School of Engineering and the Department of Management Science and Engineering. Aaron Sidford is supported in part by a Microsoft Research Faculty Fellowship, NSF CAREER Award CCF‐1844855, NSF Grant CCF‐1955039, a PayPal research gift award, and a Sloan Research Fellowship. ϵ
Funding Information:
Microsoft Research Faculty Fellowship, NSF: National Science Foundation, Grant CCF‐1955039; NSF CAREER Award, CCF‐1844855; PayPal research gift award, Sloan Research Fellowship, Stanford University Department of Management Science and Engineering, Stanford University School of Engineering Funding information
Publisher Copyright:
© 2021 Wiley Periodicals LLC
PY - 2021
Y1 - 2021
N2 - In this paper we provide faster algorithms for approximately solving discounted Markov decision processes in multiple parameter regimes. Given a discounted Markov decision process (DMDP) with |S| states, |A| actions, discount factor γ ∈ (0, 1), and rewards in the range [−M, M], we show how to compute an ϵ-optimal policy, with probability 1 − δ in time (Note: We use (Formula presented.) to hide polylogarithmic factors in the input parameters, that is, (Formula presented.).) (Formula presented.) This contribution reflects the first nearly linear time, nearly linearly convergent algorithm for solving DMDPs for intermediate values of γ. We also show how to obtain improved sublinear time algorithms provided we can sample from the transition function in O(1) time. Under this assumption we provide an algorithm which computes an ϵ-optimal policy for (Formula presented.) with probability 1 − δ in time (Formula presented.) Furthermore, we extend both these algorithms to solve finite horizon MDPs. Our algorithms improve upon the previous best for approximately computing optimal policies for fixed-horizon MDPs in multiple parameter regimes. Interestingly, we obtain our results by a careful modification of approximate value iteration. We show how to combine classic approximate value iteration analysis with new techniques in variance reduction. Our fastest algorithms leverage further insights to ensure that our algorithms make monotonic progress towards the optimal value. This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and we hope that the analysis may be useful more broadly.
AB - In this paper we provide faster algorithms for approximately solving discounted Markov decision processes in multiple parameter regimes. Given a discounted Markov decision process (DMDP) with |S| states, |A| actions, discount factor γ ∈ (0, 1), and rewards in the range [−M, M], we show how to compute an ϵ-optimal policy, with probability 1 − δ in time (Note: We use (Formula presented.) to hide polylogarithmic factors in the input parameters, that is, (Formula presented.).) (Formula presented.) This contribution reflects the first nearly linear time, nearly linearly convergent algorithm for solving DMDPs for intermediate values of γ. We also show how to obtain improved sublinear time algorithms provided we can sample from the transition function in O(1) time. Under this assumption we provide an algorithm which computes an ϵ-optimal policy for (Formula presented.) with probability 1 − δ in time (Formula presented.) Furthermore, we extend both these algorithms to solve finite horizon MDPs. Our algorithms improve upon the previous best for approximately computing optimal policies for fixed-horizon MDPs in multiple parameter regimes. Interestingly, we obtain our results by a careful modification of approximate value iteration. We show how to combine classic approximate value iteration analysis with new techniques in variance reduction. Our fastest algorithms leverage further insights to ensure that our algorithms make monotonic progress towards the optimal value. This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and we hope that the analysis may be useful more broadly.
KW - Markov decision processes
KW - linear programming algorithm
KW - value iteration
UR - http://www.scopus.com/inward/record.url?scp=85104679194&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85104679194&partnerID=8YFLogxK
U2 - 10.1002/nav.21992
DO - 10.1002/nav.21992
M3 - Article
AN - SCOPUS:85104679194
SN - 0894-069X
JO - Naval Research Logistics Quarterly
JF - Naval Research Logistics Quarterly
ER -