Bias-corrected q-learning with multistate extension

Donghun Lee, Warren B. Powell

Research output: Contribution to journalArticlepeer-review

9 Scopus citations


Q-learning is a sample-based model-free algorithm that solves Markov decision problems asymptotically, but in finite time, it can perform poorly when random rewards and transitions result in large variance of value estimates. We pinpoint its cause to be the estimation bias due to the maximum operator in Q-learning algorithm, and present the evidence of max-operator bias in its Q value estimates. We then present an asymptotically optimal bias-correction strategy and construct an extension to bias-corrected Qlearning algorithm tomultistate Markov decision processes, with asymptotic convergence properties as strong as those from Q-learning.We report the empirical performance of the bias-corrected Q-learning algorithm with multistate extension in two model problems: A multiarmed bandit version of Roulette and an electricity storage control simulation. The bias-corrected Q-learning algorithm with multistate extension is shown to control max-operator bias effectively, where the bias-resistance can be tuned predictably by adjusting a correction parameter.

Original languageEnglish (US)
Article number8695133
Pages (from-to)4011-4023
Number of pages13
JournalIEEE Transactions on Automatic Control
Issue number10
StatePublished - Oct 2019

All Science Journal Classification (ASJC) codes

  • Control and Systems Engineering
  • Computer Science Applications
  • Electrical and Electronic Engineering


  • Bias correction
  • Electricity storage
  • Q-learning
  • Smart grid


Dive into the research topics of 'Bias-corrected q-learning with multistate extension'. Together they form a unique fingerprint.

Cite this