TY - GEN

T1 - Reinforcement learning in feature space

T2 - 37th International Conference on Machine Learning, ICML 2020

AU - Yang, Lin F.

AU - Wang, Mengdi

N1 - Funding Information:
Mengdi Wang gratefully acknowledges funding from the U.S. NationalScience Foundation(NSF)grant CMMI-1653435, Air Force Office of Scientific Research (AFOSR) grant FA9550-19-1-020, and C3.ai DTI.
Funding Information:
Mengdi Wang gratefully acknowledges funding from the U.S. National Science Foundation (NSF) grant CMMI-1653435, Air Force Office of Scientific Research (AFOSR) grant FA9550-19-1-020, and C3.ai DTI.
Publisher Copyright:
Copyright 2020 by the author(s).

PY - 2020

Y1 - 2020

N2 - Exploration in reinforcement learning (RL) suffers from the curse of dimensionality when the state-action space is large. A common practice is to parameterize the high-dimensional value and policy functions using given features. However existing methods either have no theoretical guarantee or suffer a regret that is exponential in the planning horizon H. In this paper, we propose an online RL algorithm, namely the MatrixRL, that leverages ideas from linear bandit to learn a low-dimensional representation of the probability transition model while carefully balancing the exploitation-exploration tradeoff. We show that MatrixRL achieves a regret bound O(H2dlog T √T) where d is the number of features, independent with the number of state-action pairs. MatrixRL has an equivalent kernelized version, which is able to work with an arbitrary kernel Hilbert space without using explicit features. In this case, the kernelized MatrixRL satisfies a regret bound O(H2delog T √T), where de is the effective dimension of the kernel space.

AB - Exploration in reinforcement learning (RL) suffers from the curse of dimensionality when the state-action space is large. A common practice is to parameterize the high-dimensional value and policy functions using given features. However existing methods either have no theoretical guarantee or suffer a regret that is exponential in the planning horizon H. In this paper, we propose an online RL algorithm, namely the MatrixRL, that leverages ideas from linear bandit to learn a low-dimensional representation of the probability transition model while carefully balancing the exploitation-exploration tradeoff. We show that MatrixRL achieves a regret bound O(H2dlog T √T) where d is the number of features, independent with the number of state-action pairs. MatrixRL has an equivalent kernelized version, which is able to work with an arbitrary kernel Hilbert space without using explicit features. In this case, the kernelized MatrixRL satisfies a regret bound O(H2delog T √T), where de is the effective dimension of the kernel space.

UR - http://www.scopus.com/inward/record.url?scp=85105391101&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85105391101&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85105391101

T3 - 37th International Conference on Machine Learning, ICML 2020

SP - 10677

EP - 10687

BT - 37th International Conference on Machine Learning, ICML 2020

A2 - Daume, Hal

A2 - Singh, Aarti

PB - International Machine Learning Society (IMLS)

Y2 - 13 July 2020 through 18 July 2020

ER -