Abstract
This paper makes progress toward learning Nash equilibria in two-player, zero-sum Markov games from offline data. Specifically, consider a γ-discounted, infinite-horizon Markov game with S states, in which the max-player has A actions and the min-player has B actions. We propose a pessimistic model–based algorithm with Bernstein-style lower confidence bounds—called the value iteration with lower confidence bounds for zero-sum Markov games—that provably finds an ε-approximate Nash equilibrium with a sample complexity no larger than C?clipped(1-γS)(3Aε+2B) (up to some log factor). Here, C?clipped is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy ε can be any value within (0, 1-1γi . Our sample complexity bound strengthens prior art by a factor of min{A, B}, achieving minimax optimality for a broad regime of interest. An appealing feature of our result lies in its algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 2430-2445 |
| Number of pages | 16 |
| Journal | Operations Research |
| Volume | 72 |
| Issue number | 6 |
| DOIs | |
| State | Published - Nov 2024 |
| Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Computer Science Applications
- Management Science and Operations Research
Keywords
- Nash equilibrium
- curse of multiple agents
- minimax optimality
- model-based approach
- offline RL
- unilateral coverage
- zero-sum Markov games