Neural Temporal Difference and Q Learning Provably Converge to Global Optima

Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang

Research output: Contribution to journalArticlepeer-review

Abstract

Temporal difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, because of the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparameterization of neural networks, which also plays a vital role in the empirical success of neural TD. We establish the theory for two-layer neural networks in the main paper and extend them to multilayer neural networks in the appendix. Beyond policy evaluation, we establish the global convergence of neural (soft) Q learning.

Original languageEnglish (US)
Pages (from-to)619-651
Number of pages33
JournalMathematics of Operations Research
Volume49
Issue number1
DOIs
StatePublished - Feb 2024
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • General Mathematics
  • Computer Science Applications
  • Management Science and Operations Research

Keywords

  • overparameterized neural network
  • reinforcement learning
  • temporal difference learning

Fingerprint

Dive into the research topics of 'Neural Temporal Difference and Q Learning Provably Converge to Global Optima'. Together they form a unique fingerprint.

Cite this