Neural Temporal Difference and Q Learning Provably Converge to Global Optima

Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang

Research output: Contribution to journalArticlepeer-review


Temporal difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, because of the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparameterization of neural networks, which also plays a vital role in the empirical success of neural TD. We establish the theory for two-layer neural networks in the main paper and extend them to multilayer neural networks in the appendix. Beyond policy evaluation, we establish the global convergence of neural (soft) Q learning.

Original languageEnglish (US)
Pages (from-to)619-651
Number of pages33
JournalMathematics of Operations Research
Issue number1
StatePublished - Feb 2024
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • General Mathematics
  • Computer Science Applications
  • Management Science and Operations Research


  • overparameterized neural network
  • reinforcement learning
  • temporal difference learning


Dive into the research topics of 'Neural Temporal Difference and Q Learning Provably Converge to Global Optima'. Together they form a unique fingerprint.

Cite this