Abstract
Temporal difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, because of the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparameterization of neural networks, which also plays a vital role in the empirical success of neural TD. We establish the theory for two-layer neural networks in the main paper and extend them to multilayer neural networks in the appendix. Beyond policy evaluation, we establish the global convergence of neural (soft) Q learning.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 619-651 |
| Number of pages | 33 |
| Journal | Mathematics of Operations Research |
| Volume | 49 |
| Issue number | 1 |
| DOIs | |
| State | Published - Feb 2024 |
All Science Journal Classification (ASJC) codes
- General Mathematics
- Computer Science Applications
- Management Science and Operations Research
Keywords
- overparameterized neural network
- reinforcement learning
- temporal difference learning