## Abstract

In light of the Bellman duality, we propose a novel value-policy gradient algorithm to explore and act in infinite-horizon Average-reward Markov Decision Process (AMDP) and show that it has sublinear regret. The algorithm is motivated by the Bellman saddle point formulation. It learns the optimal state-action distribution, which encodes a randomized policy, by interacting with the environment along a single trajectory and making primal-dual updates. The key to the analysis is to establish a connection between the min-max duality gap of Bellman saddle point and the cumulative regret of the learning agent. We show that, for ergodic AMDPs with finite state space S and action space A and uniformly bounded mixing times, the algorithm's T-time step regret is R(T) = Õ ( (t^{∗}_{mix})^{2} τ ^{3 2 p}(τ^{3} + |A|)|S|T), where t^{∗}_{mix} is the worst-case mixing time, τ is an ergodicity parameter, T is the number of time steps and Õ hides polylog factors.

Original language | English (US) |
---|---|

Pages (from-to) | 191-200 |

Number of pages | 10 |

Journal | Proceedings of Machine Learning Research |

Volume | 120 |

State | Published - 2020 |

Externally published | Yes |

Event | 2nd Annual Conference on Learning for Dynamics and Control, L4DC 2020 - Berkeley, United States Duration: Jun 10 2020 → Jun 11 2020 |

## All Science Journal Classification (ASJC) codes

- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability

## Keywords

- Markov decision process
- exponentiated gradient
- online learning
- primal-dual method
- regret analysis
- reinforcement learning
- saddle point