Minimax-optimal off-policy evaluation with linear function approximation

Yaqi Duan, Zeyu Jia, Mengdi Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

28 Scopus citations

Abstract

This paper studies the statistical theory of offpolicy policy evaluation with function approximation in batch data reinforcement learning problem. We consider a regression-based fitted Q iteration method, and show that it is equivalent to a modelbased method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted !2-divergence over the function class between the long-term distribution of target policy and the distribution of past data. This restricted !2-divergence characterizes the statistical limit of off-policy evaluation, and is both instance-dependent and function-classdependent. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.

Original languageEnglish (US)
Title of host publication37th International Conference on Machine Learning, ICML 2020
EditorsHal Daume, Aarti Singh
PublisherInternational Machine Learning Society (IMLS)
Pages2681-2689
Number of pages9
ISBN (Electronic)9781713821120
StatePublished - 2020
Event37th International Conference on Machine Learning, ICML 2020 - Virtual, Online
Duration: Jul 13 2020Jul 18 2020

Publication series

Name37th International Conference on Machine Learning, ICML 2020
VolumePartF168147-4

Conference

Conference37th International Conference on Machine Learning, ICML 2020
CityVirtual, Online
Period7/13/207/18/20

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Human-Computer Interaction
  • Software

Fingerprint

Dive into the research topics of 'Minimax-optimal off-policy evaluation with linear function approximation'. Together they form a unique fingerprint.

Cite this