TY - GEN
T1 - Optimistic Natural Policy Gradient
T2 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
AU - Liu, Qinghua
AU - Jin, Chi
AU - Weisz, Gellért
AU - György, András
AU - Szepesvári, Csaba
N1 - Publisher Copyright:
© 2023 Neural information processing systems foundation. All rights reserved.
PY - 2023
Y1 - 2023
N2 - While policy optimization algorithms have played an important role in recent empirical success of Reinforcement Learning (RL), the existing theoretical understanding of policy optimization remains rather limited-they are either restricted to tabular MDPs or suffer from highly suboptimal sample complexity, especially in online RL where exploration is necessary. This paper proposes a simple efficient policy optimization framework-OPTIMISTIC NPG for online RL. OPTIMISTIC NPG can be viewed as a simple combination of the classic natural policy gradient (NPG) algorithm [Kakade, 2001] and an optimistic policy evaluation subroutine to encourage exploration. For d-dimensional linear MDPs, OPTIMISTIC NPG is computationally efficient, and learns an ϵ-optimal policy within Õ(d2/ϵ3) samples, which is the first computationally efficient algorithm whose sample complexity has the optimal dimension dependence Θ̃(d2). It also improves over state-of-the-art results of policy optimization algorithms [Zanette et al., 2021] by a factor of d. For general function approximation that subsumes linear MDPs, OPTIMISTIC NPG, to our best knowledge, is also the first policy optimization algorithm that achieves the polynomial sample complexity for learning near-optimal policies.
AB - While policy optimization algorithms have played an important role in recent empirical success of Reinforcement Learning (RL), the existing theoretical understanding of policy optimization remains rather limited-they are either restricted to tabular MDPs or suffer from highly suboptimal sample complexity, especially in online RL where exploration is necessary. This paper proposes a simple efficient policy optimization framework-OPTIMISTIC NPG for online RL. OPTIMISTIC NPG can be viewed as a simple combination of the classic natural policy gradient (NPG) algorithm [Kakade, 2001] and an optimistic policy evaluation subroutine to encourage exploration. For d-dimensional linear MDPs, OPTIMISTIC NPG is computationally efficient, and learns an ϵ-optimal policy within Õ(d2/ϵ3) samples, which is the first computationally efficient algorithm whose sample complexity has the optimal dimension dependence Θ̃(d2). It also improves over state-of-the-art results of policy optimization algorithms [Zanette et al., 2021] by a factor of d. For general function approximation that subsumes linear MDPs, OPTIMISTIC NPG, to our best knowledge, is also the first policy optimization algorithm that achieves the polynomial sample complexity for learning near-optimal policies.
UR - https://www.scopus.com/pages/publications/85175516716
UR - https://www.scopus.com/pages/publications/85175516716#tab=citedBy
M3 - Conference contribution
AN - SCOPUS:85175516716
T3 - Advances in Neural Information Processing Systems
BT - Advances in Neural Information Processing Systems 36 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
A2 - Oh, A.
A2 - Neumann, T.
A2 - Globerson, A.
A2 - Saenko, K.
A2 - Hardt, M.
A2 - Levine, S.
PB - Neural information processing systems foundation
Y2 - 10 December 2023 through 16 December 2023
ER -