TY - GEN

T1 - A Sharp Analysis of Model-based Reinforcement Learning with Self-Play

AU - Liu, Qinghua

AU - Yu, Tiancheng

AU - Bai, Yu

AU - Jin, Chi

N1 - Publisher Copyright:
Copyright © 2021 by the author(s)

PY - 2021

Y1 - 2021

N2 - Model-based algorithms-algorithms that explore the environment through building and utilizing an estimated model-are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an ∊-approximate Nash policy in Õ(H3SAB/∊2) episodes of game playing, where S is the number of states, A, B are the number of actions for the two players respectively, and H is the horizon length. This significantly improves over the best known model-based guarantee of Õ(H4S2AB/∊2), and is the first that matches the information-theoretic lower bound Ω(H3S(A + B)/∊2) except for a min {A, B} factor. In addition, our guarantee compares favorably against the best known model-free algorithm if min {A, B} = o(H3), and outputs a single Markov policy while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.

AB - Model-based algorithms-algorithms that explore the environment through building and utilizing an estimated model-are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an ∊-approximate Nash policy in Õ(H3SAB/∊2) episodes of game playing, where S is the number of states, A, B are the number of actions for the two players respectively, and H is the horizon length. This significantly improves over the best known model-based guarantee of Õ(H4S2AB/∊2), and is the first that matches the information-theoretic lower bound Ω(H3S(A + B)/∊2) except for a min {A, B} factor. In addition, our guarantee compares favorably against the best known model-free algorithm if min {A, B} = o(H3), and outputs a single Markov policy while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.

UR - http://www.scopus.com/inward/record.url?scp=85161270685&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85161270685&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85161270685

T3 - Proceedings of Machine Learning Research

SP - 7001

EP - 7010

BT - Proceedings of the 38th International Conference on Machine Learning, ICML 2021

PB - ML Research Press

T2 - 38th International Conference on Machine Learning, ICML 2021

Y2 - 18 July 2021 through 24 July 2021

ER -