TY - JOUR
T1 - Sample complexity of asynchronous Q-learning
T2 - 34th Conference on Neural Information Processing Systems, NeurIPS 2020
AU - Li, Gen
AU - Wei, Yuting
AU - Chi, Yuejie
AU - Gu, Yuantao
AU - Chen, Yuxin
N1 - Funding Information:
G. Li and Y. Gu are supported in part by the grant NSFC-61971266. Y. Wei is supported in part by the grants NSF CCF-2007911 and DMS-2015447. Y. Chi is supported in part by the grants ONR N00014-18-1-2142 and N00014-19-1-2404, ARO W911NF-18-1-0303, and NSF CCF-1806154 and CCF-2007911. Y. Chen is supported in part by the grants AFOSR YIP award FA9550-19-1-0030, ONR N00014-19-1-2120, ARO YIP award W911NF-20-1-0097, ARO W911NF-18-1-0303, NSF CCF-1907661, DMS-2014279 and IIS-1900140.
Publisher Copyright:
© 2020 Neural information processing systems foundation. All rights reserved.
PY - 2020
Y1 - 2020
N2 - Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a ?-discounted MDP with state space S and action space A, we demonstrate that the l8-based sample complexity of classical asynchronous Q-learning — namely, the number of samples needed to yield an entrywise e-accurate estimate of the Q-function — is at most on the order of 1 tmix µmin(1 - ?)5e2 + µmin(1 - ?) up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, tmix and µmin denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the complexity in the case with independent samples drawn from the stationary distribution of the trajectory. The second term reflects the expense taken for the empirical distribution of the Markovian trajectory to reach a steady state, which is incurred at the very beginning and becomes amortized as the algorithm runs. Encouragingly, the above bound improves upon the state-of-the-art result by a factor of at least |S||A|. Further, the scaling on the discount complexity can be improved by means of variance reduction.
AB - Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a ?-discounted MDP with state space S and action space A, we demonstrate that the l8-based sample complexity of classical asynchronous Q-learning — namely, the number of samples needed to yield an entrywise e-accurate estimate of the Q-function — is at most on the order of 1 tmix µmin(1 - ?)5e2 + µmin(1 - ?) up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, tmix and µmin denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the complexity in the case with independent samples drawn from the stationary distribution of the trajectory. The second term reflects the expense taken for the empirical distribution of the Markovian trajectory to reach a steady state, which is incurred at the very beginning and becomes amortized as the algorithm runs. Encouragingly, the above bound improves upon the state-of-the-art result by a factor of at least |S||A|. Further, the scaling on the discount complexity can be improved by means of variance reduction.
UR - http://www.scopus.com/inward/record.url?scp=85105080273&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85105080273&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85105080273
SN - 1049-5258
VL - 2020-December
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
Y2 - 6 December 2020 through 12 December 2020
ER -