Near-optimal reinforcement learning with self-play

Yu Bai, Chi Jin, Tiancheng Yu

Research output: Contribution to journalConference articlepeer-review

59 Scopus citations


This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with S states, A max-player actions and B min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires Õ(S2AB) steps of game playing, when only highlighting the dependency on (S, A, B). In contrast, the best existing lower bound scales as Ω(S(A + B)) and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the Nash Q-learning algorithm with sample complexity Õ(SAB), and a new Nash V-learning algorithm with sample complexity Õ(S(A + B)). The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games—a learning objective different from finding the Nash equilibrium.

Original languageEnglish (US)
JournalAdvances in Neural Information Processing Systems
StatePublished - 2020
Event34th Conference on Neural Information Processing Systems, NeurIPS 2020 - Virtual, Online
Duration: Dec 6 2020Dec 12 2020

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing


Dive into the research topics of 'Near-optimal reinforcement learning with self-play'. Together they form a unique fingerprint.

Cite this