Abstract
In light of the Bellman duality, we propose a novel value-policy gradient algorithm to explore and act in infinite-horizon Average-reward Markov Decision Process (AMDP) and show that it has sublinear regret. The algorithm is motivated by the Bellman saddle point formulation. It learns the optimal state-action distribution, which encodes a randomized policy, by interacting with the environment along a single trajectory and making primal-dual updates. The key to the analysis is to establish a connection between the min-max duality gap of Bellman saddle point and the cumulative regret of the learning agent. We show that, for ergodic AMDPs with finite state space S and action space A and uniformly bounded mixing times, the algorithm's T-time step regret is R(T) = Õ ( (t∗mix)2 τ 3 2 p(τ3 + |A|)|S|T), where t∗mix is the worst-case mixing time, τ is an ergodicity parameter, T is the number of time steps and Õ hides polylog factors.
Original language | English (US) |
---|---|
Pages (from-to) | 191-200 |
Number of pages | 10 |
Journal | Proceedings of Machine Learning Research |
Volume | 120 |
State | Published - 2020 |
Externally published | Yes |
Event | 2nd Annual Conference on Learning for Dynamics and Control, L4DC 2020 - Berkeley, United States Duration: Jun 10 2020 → Jun 11 2020 |
All Science Journal Classification (ASJC) codes
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability
Keywords
- Markov decision process
- exponentiated gradient
- online learning
- primal-dual method
- regret analysis
- reinforcement learning
- saddle point