TY - JOUR

T1 - Sample complexity of asynchronous Q-learning

T2 - 34th Conference on Neural Information Processing Systems, NeurIPS 2020

AU - Li, Gen

AU - Wei, Yuting

AU - Chi, Yuejie

AU - Gu, Yuantao

AU - Chen, Yuxin

N1 - Funding Information:
G. Li and Y. Gu are supported in part by the grant NSFC-61971266. Y. Wei is supported in part by the grants NSF CCF-2007911 and DMS-2015447. Y. Chi is supported in part by the grants ONR N00014-18-1-2142 and N00014-19-1-2404, ARO W911NF-18-1-0303, and NSF CCF-1806154 and CCF-2007911. Y. Chen is supported in part by the grants AFOSR YIP award FA9550-19-1-0030, ONR N00014-19-1-2120, ARO YIP award W911NF-20-1-0097, ARO W911NF-18-1-0303, NSF CCF-1907661, DMS-2014279 and IIS-1900140.
Publisher Copyright:
© 2020 Neural information processing systems foundation. All rights reserved.

PY - 2020

Y1 - 2020

N2 - Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a ?-discounted MDP with state space S and action space A, we demonstrate that the l8-based sample complexity of classical asynchronous Q-learning — namely, the number of samples needed to yield an entrywise e-accurate estimate of the Q-function — is at most on the order of 1 tmix µmin(1 - ?)5e2 + µmin(1 - ?) up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, tmix and µmin denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the complexity in the case with independent samples drawn from the stationary distribution of the trajectory. The second term reflects the expense taken for the empirical distribution of the Markovian trajectory to reach a steady state, which is incurred at the very beginning and becomes amortized as the algorithm runs. Encouragingly, the above bound improves upon the state-of-the-art result by a factor of at least |S||A|. Further, the scaling on the discount complexity can be improved by means of variance reduction.

AB - Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a ?-discounted MDP with state space S and action space A, we demonstrate that the l8-based sample complexity of classical asynchronous Q-learning — namely, the number of samples needed to yield an entrywise e-accurate estimate of the Q-function — is at most on the order of 1 tmix µmin(1 - ?)5e2 + µmin(1 - ?) up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, tmix and µmin denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the complexity in the case with independent samples drawn from the stationary distribution of the trajectory. The second term reflects the expense taken for the empirical distribution of the Markovian trajectory to reach a steady state, which is incurred at the very beginning and becomes amortized as the algorithm runs. Encouragingly, the above bound improves upon the state-of-the-art result by a factor of at least |S||A|. Further, the scaling on the discount complexity can be improved by means of variance reduction.

UR - http://www.scopus.com/inward/record.url?scp=85105080273&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85105080273&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85105080273

SN - 1049-5258

VL - 2020-December

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

Y2 - 6 December 2020 through 12 December 2020

ER -