TY - JOUR

T1 - On function approximation in reinforcement learning

T2 - 34th Conference on Neural Information Processing Systems, NeurIPS 2020

AU - Yang, Zhuoran

AU - Jin, Chi

AU - Wang, Zhaoran

AU - Wang, Mengdi

AU - Jordan, Michael I.

N1 - Funding Information:
We would like to thank the Simons Institute for the Theory of Computing in Berkeley, where this project was initiated. Zhuoran Yang would like to thank Jianqing Fan, Csaba Szepsvári, Tuo Zhao, Simon Shaolei Du, Ruosong Wang, and Yiping Lu for valuable discussions. Mengdi Wang gratefully acknowledges funding from the U.S. National Science Foundation (NSF) grant CMMI1653435, Air Force Office of Scientific Research (AFOSR) grant FA9550-19-1-020, and C3.ai DTI. Michael Jordan gratefully acknowledges funding from the Mathematical Data Science program of the Office of Naval Research under grant number N00014-18-1-2764.
Funding Information:
We would like to thank the Simons Institute for the Theory of Computing in Berkeley, where this project was initiated. Zhuoran Yang would like to thank Jianqing Fan, Csaba Szepsv?ri, Tuo Zhao, Simon Shaolei Du, Ruosong Wang, and Yiping Lu for valuable discussions. Mengdi Wang gratefully acknowledges funding from the U.S. National Science Foundation (NSF) grant CMMI1653435, Air Force Office of Scientific Research (AFOSR) grant FA9550-19-1-020, and C3.ai DTI. Michael Jordan gratefully acknowledges funding from the Mathematical Data Science program of the Office of Naval Research under grant number N00014-18-1-2764.
Publisher Copyright:
© 2020 Neural information processing systems foundation. All rights reserved.

PY - 2020

Y1 - 2020

N2 - The classical theory of reinforcement learning (RL) has focused on tabular and linear representations of value functions. Further progress hinges on combining RL with modern function approximators such as kernel functions and deep neural networks, and indeed there have been many empirical successes that have exploited such combinations in large-scale applications. There are profound challenges, however, in developing a theory to support this enterprise, most notably the need to take into consideration the exploration-exploitation tradeoff at the core of RL in conjunction with the computational and statistical tradeoffs that arise in modern function-approximation-based learning systems. We approach these challenges by studying an optimistic modification of the least-squares value iteration algorithm, in the context of the action-value function represented by a kernel function or an overparameterized neural network. We establish both polynomial runtime complexity and polynomial sample complexity for this algorithm, without additional assumptions on the data-generating model. In particular, we prove that the algorithm incurs an Oe(dFH2pT) regret, where dF characterizes the intrinsic complexity of the function class F, H is the length of each episode, and T is the total number of episodes. Our regret bounds are independent of the number of states, a result which exhibits clearly the benefit of function approximation in RL.

AB - The classical theory of reinforcement learning (RL) has focused on tabular and linear representations of value functions. Further progress hinges on combining RL with modern function approximators such as kernel functions and deep neural networks, and indeed there have been many empirical successes that have exploited such combinations in large-scale applications. There are profound challenges, however, in developing a theory to support this enterprise, most notably the need to take into consideration the exploration-exploitation tradeoff at the core of RL in conjunction with the computational and statistical tradeoffs that arise in modern function-approximation-based learning systems. We approach these challenges by studying an optimistic modification of the least-squares value iteration algorithm, in the context of the action-value function represented by a kernel function or an overparameterized neural network. We establish both polynomial runtime complexity and polynomial sample complexity for this algorithm, without additional assumptions on the data-generating model. In particular, we prove that the algorithm incurs an Oe(dFH2pT) regret, where dF characterizes the intrinsic complexity of the function class F, H is the length of each episode, and T is the total number of episodes. Our regret bounds are independent of the number of states, a result which exhibits clearly the benefit of function approximation in RL.

UR - http://www.scopus.com/inward/record.url?scp=85108445437&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85108445437&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85108445437

SN - 1049-5258

VL - 2020-December

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

Y2 - 6 December 2020 through 12 December 2020

ER -