TY - GEN
T1 - Provably efficient exploration in policy optimization
AU - Cai, Qi
AU - Yang, Zhuoran
AU - Jin, Chi
AU - Wang, Zhaoran
N1 - Publisher Copyright:
© 2020 37th International Conference on Machine Learning, ICML 2020. All rights reserved.
PY - 2020
Y1 - 2020
N2 - While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an "optimistic version" of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves eO (pd2H3T) regret. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
AB - While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an "optimistic version" of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves eO (pd2H3T) regret. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
UR - http://www.scopus.com/inward/record.url?scp=85098439513&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098439513&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85098439513
T3 - 37th International Conference on Machine Learning, ICML 2020
SP - 1260
EP - 1271
BT - 37th International Conference on Machine Learning, ICML 2020
A2 - Daume, Hal
A2 - Singh, Aarti
PB - International Machine Learning Society (IMLS)
T2 - 37th International Conference on Machine Learning, ICML 2020
Y2 - 13 July 2020 through 18 July 2020
ER -