Breaking the sample size barrier in model-based reinforcement learning with a generative model

Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, Yuxin Chen

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

We investigate the sample efficiency of reinforcement learning in a ?-discounted infinite-horizon Markov decision process (MDP) with state space S and action space A, assuming access to a generative model. Despite a number of prior work tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, prior results suffer from a sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least (1|S||A|-?)2 (up to some log factor). The current paper overcomes this barrier by certifying the minimax optimality of model-based reinforcement learning as soon as the sample size exceeds the order of |S||A|1-? (modulo some log factor). More specifically, a perturbed model-based planning algorithm provably finds an e-optimal policy with an order of (1|S||A|-?)3e2 log (1|S||A|-?)e samples for any e ? (0, 1-1? ]. Along the way, we derive improved (instance-dependent) guarantees for model-based policy evaluation. To the best of our knowledge, this work provides the first minimax-optimal guarantee in a generative model that accommodates the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically impossible).

Original languageEnglish (US)
JournalAdvances in Neural Information Processing Systems
Volume2020-December
StatePublished - 2020
Event34th Conference on Neural Information Processing Systems, NeurIPS 2020 - Virtual, Online
Duration: Dec 6 2020Dec 12 2020

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Fingerprint

Dive into the research topics of 'Breaking the sample size barrier in model-based reinforcement learning with a generative model'. Together they form a unique fingerprint.

Cite this