Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound

Lin F. Yang, Mengdi Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Exploration in reinforcement learning (RL) suffers from the curse of dimensionality when the state-action space is large. A common practice is to parameterize the high-dimensional value and policy functions using given features. However existing methods either have no theoretical guarantee or suffer a regret that is exponential in the planning horizon H. In this paper, we propose an online RL algorithm, namely the MatrixRL, that leverages ideas from linear bandit to learn a low-dimensional representation of the probability transition model while carefully balancing the exploitation-exploration tradeoff. We show that MatrixRL achieves a regret bound O(H2dlog T √T) where d is the number of features, independent with the number of state-action pairs. MatrixRL has an equivalent kernelized version, which is able to work with an arbitrary kernel Hilbert space without using explicit features. In this case, the kernelized MatrixRL satisfies a regret bound O(H2delog T √T), where de is the effective dimension of the kernel space.

Original languageEnglish (US)
Title of host publication37th International Conference on Machine Learning, ICML 2020
EditorsHal Daume, Aarti Singh
PublisherInternational Machine Learning Society (IMLS)
Pages10677-10687
Number of pages11
ISBN (Electronic)9781713821120
StatePublished - 2020
Event37th International Conference on Machine Learning, ICML 2020 - Virtual, Online
Duration: Jul 13 2020Jul 18 2020

Publication series

Name37th International Conference on Machine Learning, ICML 2020
VolumePartF168147-14

Conference

Conference37th International Conference on Machine Learning, ICML 2020
CityVirtual, Online
Period7/13/207/18/20

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Human-Computer Interaction
  • Software

Fingerprint Dive into the research topics of 'Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound'. Together they form a unique fingerprint.

Cite this