Abstract
While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an “optimistic version” of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and OPPO achieves (Formula Presented) regret. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.1 The authors would like to thank Lingxiao Wang, Wen Sun, and Sham Kakade for pointing out a technical issue in the initial version regarding the covering number of value functions in the linear setting. Such a technical issue is fixed in the camera-ready version with a definition of the linear MDP different from the one in the initial version. The authors would like to thank Csaba Szepesvári, Lin F. Yang, Yining Wang, and Simon S. Du for helpful discussions. The authors would also like to thank the anonymous reviewers for the valuable comments and the program chairs for helping with preparing for the camera-ready version.
| Original language | English (US) |
|---|---|
| Journal | Proceedings of Machine Learning Research |
| Volume | 119 |
| State | Published - 2020 |
| Externally published | Yes |
| Event | 37th International Conference on Machine Learning, ICML 2020 - Virtual, Online Duration: Jul 13 2020 → Jul 18 2020 |
All Science Journal Classification (ASJC) codes
- Software
- Control and Systems Engineering
- Statistics and Probability
- Artificial Intelligence