A Duality Approach for Regret Minimization in Average-Reward Ergodic Markov Decision Processes

Hao Gong, Mengdi Wang

Research output: Contribution to journalConference articlepeer-review

Abstract

In light of the Bellman duality, we propose a novel value-policy gradient algorithm to explore and act in infinite-horizon Average-reward Markov Decision Process (AMDP) and show that it has sublinear regret. The algorithm is motivated by the Bellman saddle point formulation. It learns the optimal state-action distribution, which encodes a randomized policy, by interacting with the environment along a single trajectory and making primal-dual updates. The key to the analysis is to establish a connection between the min-max duality gap of Bellman saddle point and the cumulative regret of the learning agent. We show that, for ergodic AMDPs with finite state space S and action space A and uniformly bounded mixing times, the algorithm's T-time step regret is R(T) = Õ ( (tmix)2 τ 3 2 p3 + |A|)|S|T), where tmix is the worst-case mixing time, τ is an ergodicity parameter, T is the number of time steps and Õ hides polylog factors.

Original languageEnglish (US)
Pages (from-to)191-200
Number of pages10
JournalProceedings of Machine Learning Research
Volume120
StatePublished - 2020
Externally publishedYes
Event2nd Annual Conference on Learning for Dynamics and Control, L4DC 2020 - Berkeley, United States
Duration: Jun 10 2020Jun 11 2020

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Keywords

  • Markov decision process
  • exponentiated gradient
  • online learning
  • primal-dual method
  • regret analysis
  • reinforcement learning
  • saddle point

Fingerprint

Dive into the research topics of 'A Duality Approach for Regret Minimization in Average-Reward Ergodic Markov Decision Processes'. Together they form a unique fingerprint.

Cite this