TY - GEN
T1 - On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method
AU - Zhang, Junyu
AU - Ni, Chengzhuo
AU - Yu, Zheng
AU - Szepesvari, Csaba
AU - Wang, Mengdi
N1 - Publisher Copyright:
© 2021 Neural information processing systems foundation. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Policy gradient (PG) gives rise to a rich class of reinforcement learning (RL) methods. Recently, there has been an emerging trend to accelerate the existing PG methods such as REINFORCE by the variance reduction techniques. However, all existing variance-reduced PG methods heavily rely on an uncheckable importance weight assumption made for every single iteration of the algorithms. In this paper, a simple gradient truncation mechanism is proposed to address this issue. Moreover, we design a Truncated Stochastic Incremental Variance-Reduced Policy Gradient (TSIVR-PG) method, which is able to maximize not only a cumulative sum of rewards but also a general utility function over a policy's long-term visiting distribution. We show an Õ(ǫ−3) sample complexity for TSIVR-PG to find an ǫ-stationary policy. By assuming the overparameterization of policy and exploiting the hidden convexity of the problem, we further show that TSIVR-PG converges to global ǫ-optimal policy with Õ(ǫ−2) samples.
AB - Policy gradient (PG) gives rise to a rich class of reinforcement learning (RL) methods. Recently, there has been an emerging trend to accelerate the existing PG methods such as REINFORCE by the variance reduction techniques. However, all existing variance-reduced PG methods heavily rely on an uncheckable importance weight assumption made for every single iteration of the algorithms. In this paper, a simple gradient truncation mechanism is proposed to address this issue. Moreover, we design a Truncated Stochastic Incremental Variance-Reduced Policy Gradient (TSIVR-PG) method, which is able to maximize not only a cumulative sum of rewards but also a general utility function over a policy's long-term visiting distribution. We show an Õ(ǫ−3) sample complexity for TSIVR-PG to find an ǫ-stationary policy. By assuming the overparameterization of policy and exploiting the hidden convexity of the problem, we further show that TSIVR-PG converges to global ǫ-optimal policy with Õ(ǫ−2) samples.
UR - http://www.scopus.com/inward/record.url?scp=85131836846&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85131836846&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85131836846
T3 - Advances in Neural Information Processing Systems
SP - 2228
EP - 2240
BT - Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
A2 - Ranzato, Marc'Aurelio
A2 - Beygelzimer, Alina
A2 - Dauphin, Yann
A2 - Liang, Percy S.
A2 - Wortman Vaughan, Jenn
PB - Neural information processing systems foundation
T2 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
Y2 - 6 December 2021 through 14 December 2021
ER -