Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes

Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan

Research output: Contribution to journalConference articlepeer-review

116 Scopus citations


Policy gradient (PG) methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); how they cope with approximation error due to using a restricted class of parametric policies; or their finite sample behavior. Such characterizations are important not only to compare these methods to their approximate value function counterparts (where such issues are relatively well understood, at least in the worst case), but also to help with more principled approaches to algorithm design. This work provides provable and comprehensive characterizations of computational, approximation, and sample size issues with regards to policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: 1) “tabular” policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy, and 2) restricted policy classes, which may not contain the optimal policy and where we provide agnostic learning results. In the tabular setting with exact gradients, our main results for the softmax policy parameterization show: asymptotic convergence to the global optimum; a polynomial convergence rate provided an additional KL-based entropy regularizer is used; and a dimension-free, “fast-rate” convergence to the global optimum using the Natural Policy Gradient (NPG). With regards to function approximation, we further analyze NPG with inexact gradients, where the policy parameterization may not contain the optimal policy. Under certain smoothness assumptions on the policy parameterization, we establish rates of convergence in terms of the quality of the initial state distribution and the quality of a certain compatible function approximation. One insight of this work is in formalizing how a favorable initial state distribution provides a means to circumvent worst-case exploration issues. Overall, these results place PG methods under a solid theoretical footing, analogous to the global convergence guarantees of iterative value function based algorithms.

Original languageEnglish (US)
Pages (from-to)64-66
Number of pages3
JournalProceedings of Machine Learning Research
StatePublished - 2020
Event33rd Conference on Learning Theory, COLT 2020 - Virtual, Online, Austria
Duration: Jul 9 2020Jul 12 2020

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability


Dive into the research topics of 'Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes'. Together they form a unique fingerprint.

Cite this