TY - JOUR
T1 - Provably Efficient Reinforcement Learning with Linear Function Approximation
AU - Jin, Chi
AU - Yang, Zhuoran
AU - Wang, Zhaoran
AU - Jordan, Michael I.
N1 - Funding Information:
Funding: This work was supported by the Defense Advanced Research Projects Agency program on Lifelong Learning Machines.
Publisher Copyright:
© 2023 INFORMS.
PY - 2023/8
Y1 - 2023/8
N2 - Modern reinforcement learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation trade-off. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic settingwith linear dynamics and linear rewards, forwhich only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial run time and polynomial sample complexity in this linear setting, without requiring a "simulator"or additional assumptions. Concretely, we prove that an optimistic modification of least-squares value iteration-a classical algorithm frequently studied in the linear setting-achieves O (√ d3H3T ) regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps. Importantly, such regret is independent of the number of states and actions.
AB - Modern reinforcement learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation trade-off. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic settingwith linear dynamics and linear rewards, forwhich only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial run time and polynomial sample complexity in this linear setting, without requiring a "simulator"or additional assumptions. Concretely, we prove that an optimistic modification of least-squares value iteration-a classical algorithm frequently studied in the linear setting-achieves O (√ d3H3T ) regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps. Importantly, such regret is independent of the number of states and actions.
KW - episodic MDP
KW - exploration
KW - linear function approximation
KW - reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=85168834905&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85168834905&partnerID=8YFLogxK
U2 - 10.1287/moor.2022.1309
DO - 10.1287/moor.2022.1309
M3 - Article
AN - SCOPUS:85168834905
SN - 0364-765X
VL - 48
SP - 1496
EP - 1521
JO - Mathematics of Operations Research
JF - Mathematics of Operations Research
IS - 3
ER -