TY - JOUR
T1 - Offline Reinforcement Learning with Realizability and Single-policy Concentrability
AU - Zhan, Wenhao
AU - Huang, Baihe
AU - Huang, Audrey
AU - Jiang, Nan
AU - Lee, Jason D.
N1 - Funding Information:
NJ acknowledges funding support from ARL Cooperative Agreement W911NF-17-2-0196, NSF IIS-2112471, NSF CAREER IIS-2141781, and Adobe Data Science Research Award. JDL and WZ acknowledges support of the ARO under MURI Award W911NF-11-1-0304, the Sloan Research Fellowship, NSF CCF 2002272, NSF IIS 2107304, ONR Young Investigator Award, and NSF Career Award.
Publisher Copyright:
© 2022 W. Zhan, B. Huang, A. Huang, N. Jiang & J.D. Lee.
PY - 2022
Y1 - 2022
N2 - Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL.
AB - Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL.
KW - offline RL
KW - primal-dual
KW - reinforcement learning theory
UR - http://www.scopus.com/inward/record.url?scp=85164736209&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85164736209&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85164736209
SN - 2640-3498
VL - 178
SP - 2730
EP - 2775
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
T2 - 35th Conference on Learning Theory, COLT 2022
Y2 - 2 July 2022 through 5 July 2022
ER -