Abstract
Reinforcement learning (RL) applies to control problems with large state and action spaces, hence it is natural to consider RL with a parametric model. In this paper we focus on finite-horizon episodic RL where the transition model admits the linear parametrization: (Equaiton presented). This parametrization provides a universal function approximation and capture several useful models and applications. We propose an upper confidence model-based RL algorithm with value-targeted model parameter estimation. The algorithm updates the estimate of θ by recursively solving a regression problem using the latest value estimate as the target. We demonstrate the efficiency of our algorithm by proving its expected regret bound Õ(d√H3T), where H, T, d are the horizon, total number of steps and dimension of θ. This regret bound is independent of the total number of states or actions, and is close to a lower bound Ω(√HdT).
Original language | English (US) |
---|---|
Pages (from-to) | 666-686 |
Number of pages | 21 |
Journal | Proceedings of Machine Learning Research |
Volume | 120 |
State | Published - 2020 |
Externally published | Yes |
Event | 2nd Annual Conference on Learning for Dynamics and Control, L4DC 2020 - Berkeley, United States Duration: Jun 10 2020 → Jun 11 2020 |
All Science Journal Classification (ASJC) codes
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability