TY - JOUR
T1 - Breaking the sample size barrier in model-based reinforcement learning with a generative model
AU - Li, Gen
AU - Wei, Yuting
AU - Chi, Yuejie
AU - Gu, Yuantao
AU - Chen, Yuxin
N1 - Funding Information:
G. Li and Y. Gu are supported in part by the grant NSFC-61971266. Y. Wei is supported in part by the grants NSF CCF-2007911 and DMS-2015447. Y. Chi is supported in part by the grants ONR N00014-18-1-2142 and N00014-19-1-2404, ARO W911NF-18-1-0303, and NSF CCF-1806154 and CCF-2007911. Y. Chen is supported in part by the grants AFOSR YIP award FA9550-19-1-0030, ONR N00014-19-1-2120, ARO YIP award W911NF-20-1-0097, ARO W911NF-18-1-0303, NSF CCF-1907661, DMS-2014279 and IIS-1900140. We thank Qiwen Cui for pointing out an issue in the proof of Lemma 4 in an early version of this paper, and thank Shicong Cen, Chen Cheng and Cong Ma for numerous discussions.
Publisher Copyright:
© 2020 Neural information processing systems foundation. All rights reserved.
PY - 2020
Y1 - 2020
N2 - We investigate the sample efficiency of reinforcement learning in a ?-discounted infinite-horizon Markov decision process (MDP) with state space S and action space A, assuming access to a generative model. Despite a number of prior work tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, prior results suffer from a sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least (1|S||A|-?)2 (up to some log factor). The current paper overcomes this barrier by certifying the minimax optimality of model-based reinforcement learning as soon as the sample size exceeds the order of |S||A|1-? (modulo some log factor). More specifically, a perturbed model-based planning algorithm provably finds an e-optimal policy with an order of (1|S||A|-?)3e2 log (1|S||A|-?)e samples for any e ? (0, 1-1? ]. Along the way, we derive improved (instance-dependent) guarantees for model-based policy evaluation. To the best of our knowledge, this work provides the first minimax-optimal guarantee in a generative model that accommodates the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically impossible).
AB - We investigate the sample efficiency of reinforcement learning in a ?-discounted infinite-horizon Markov decision process (MDP) with state space S and action space A, assuming access to a generative model. Despite a number of prior work tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, prior results suffer from a sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least (1|S||A|-?)2 (up to some log factor). The current paper overcomes this barrier by certifying the minimax optimality of model-based reinforcement learning as soon as the sample size exceeds the order of |S||A|1-? (modulo some log factor). More specifically, a perturbed model-based planning algorithm provably finds an e-optimal policy with an order of (1|S||A|-?)3e2 log (1|S||A|-?)e samples for any e ? (0, 1-1? ]. Along the way, we derive improved (instance-dependent) guarantees for model-based policy evaluation. To the best of our knowledge, this work provides the first minimax-optimal guarantee in a generative model that accommodates the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically impossible).
UR - http://www.scopus.com/inward/record.url?scp=85108401182&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85108401182&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85108401182
SN - 1049-5258
VL - 2020-December
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 34th Conference on Neural Information Processing Systems, NeurIPS 2020
Y2 - 6 December 2020 through 12 December 2020
ER -