TY - JOUR
T1 - On the theory of policy gradient methods
T2 - Optimality, approximation, and distribution shift
AU - Agarwal, Alekh
AU - Kakade, Sham M.
AU - Lee, Jason D.
AU - Mahajan, Gaurav
N1 - Funding Information:
We thank the anonymous reviewers who provided detailed and constructive feedback that helped us significantly improve the presentation and exposition. Sham Kakade and Alekh Agarwal gratefully acknowledge numerous helpful discussions with Wen Sun with regards to the Q-NPG algorithm and our notion of transfer error. We also acknowledge numerous helpful comments from Ching-An Cheng and Andrea Zanette on an earlier draft of this work. We thank Nan Jiang, Bruno Scherrer, and Matthieu Geist for their comments with regards to the relationship between concentrability coefficients, the condition number, and the transfer error; this discussion ultimately lead to Corollary 21. Sham Kakade acknowledges funding from the Washington Research Foundation for Innovation in Data-intensive Discovery, the ONR award N00014-18-1-2247, and the DARPA award FA8650-18-2-7836. Jason D. Lee acknowledges support of the ARO under MURI Award W911NF-11-1-0303. This is part of the collaboration between US DOD, UK MOD and UK Engineering and Physical Research Council (EPSRC) under the Multidisciplinary University Research Initiative.
Publisher Copyright:
© 2021 Microtome Publishing. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: If and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case-which avoid explicit worst-case dependencies on the size of state space-by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).
AB - Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: If and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case-which avoid explicit worst-case dependencies on the size of state space-by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).
KW - Policy Gradient
KW - Reinforcement Learning
UR - http://www.scopus.com/inward/record.url?scp=85107297227&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107297227&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:85107297227
VL - 22
JO - Journal of Machine Learning Research
JF - Journal of Machine Learning Research
SN - 1532-4435
ER -