Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage

Masatoshi Uehara, Jason D. Lee, Nathan Kallus, Wen Sun

Research output: Contribution to journalConference articlepeer-review

2 Scopus citations

Abstract

In offline RL, we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, and we want to make these assumptions as harmless as possible. In this work, we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage of just a single comparator policy, and realizability of the soft (entropy-regularized) Q-function of the single policy and a related function defined as a saddle point of certain minimax optimization problem. This offers refined and generally more lax conditions for offline RL. We further show an analogous result for vanilla Q-functions under a soft margin condition. To attain these guarantees, we leverage novel minimax learning algorithms and analyses to accurately estimate either soft or vanilla Q-functions with strong L2-convergence guarantees. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying. Surprisingly we handle partial coverage even without explicitly enforcing pessimism.

Original languageEnglish (US)
JournalAdvances in Neural Information Processing Systems
Volume36
StatePublished - 2023
Event37th Conference on Neural Information Processing Systems, NeurIPS 2023 - New Orleans, United States
Duration: Dec 10 2023Dec 16 2023

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Fingerprint

Dive into the research topics of 'Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage'. Together they form a unique fingerprint.

Cite this