Abstract
In offline RL, we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, and we want to make these assumptions as harmless as possible. In this work, we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage of just a single comparator policy, and realizability of the soft (entropy-regularized) Q-function of the single policy and a related function defined as a saddle point of certain minimax optimization problem. This offers refined and generally more lax conditions for offline RL. We further show an analogous result for vanilla Q-functions under a soft margin condition. To attain these guarantees, we leverage novel minimax learning algorithms and analyses to accurately estimate either soft or vanilla Q-functions with strong L2-convergence guarantees. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying. Surprisingly we handle partial coverage even without explicitly enforcing pessimism.
Original language | English (US) |
---|---|
Journal | Advances in Neural Information Processing Systems |
Volume | 36 |
State | Published - 2023 |
Event | 37th Conference on Neural Information Processing Systems, NeurIPS 2023 - New Orleans, United States Duration: Dec 10 2023 → Dec 16 2023 |
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications
- Information Systems
- Signal Processing