Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

  • Chi Jin
  • , Tiancheng Jin
  • , Haipeng Luo
  • , Suvrit Sra
  • , Tiancheng Yu

Research output: Contribution to journalConference articlepeer-review

12 Scopus citations

Abstract

We consider the task of learning in episodic finite horizon Markov decision processes with an un known transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves Õ(L|X|√ |AT)regret with high probability, where L is the horizon, X the number of states, |A| the number of actions, and T the number of episodes. To our knowledge, our algorithm is the first to ensure Õ(√T) regret in this challenging setting; in fact it achieves the same regret as (Rosenberg & Mansour, 2019a) who consider the easier setting with full-information. Our key contributions are two-fold: a tighter confidence set for the transition function; and an optimistic loss estimator that is inversely weighted by an upper occupancy bound.

Original languageEnglish (US)
JournalProceedings of Machine Learning Research
Volume119
StatePublished - 2020
Event37th International Conference on Machine Learning, ICML 2020 - Virtual, Online
Duration: Jul 13 2020Jul 18 2020

All Science Journal Classification (ASJC) codes

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition'. Together they form a unique fingerprint.

Cite this