TY - GEN
T1 - Provable self-play algorithms for competitive reinforcement learning
AU - Bai, Yu
AU - Jin, Chi
N1 - Publisher Copyright:
© ICML 2020. All rights reserved.
PY - 2020
Y1 - 2020
N2 - Self-play, where the algorithm learns by playing against itself without requiring any direct supervision, has become the new weapon in modern Reinforcement Learning (RL) for achieving superhuman performance in practice. However, the majority of exisiting theory in reinforcement learning only applies to the setting where the agent plays against a fixed environment; it remains largely open whether self-play algorithms can be provably effective, especially when it is necessary to manage the exploration/exploitation tradeoff. We study self-play in competitive reinforcement learning under the setting of Markov games, a generalization of Markov decision processes to the twoplayer case. We introduce a self-play algorithm-Value Iteration with Upper/Lower Confidence Bound (VI-ULCB)-And show that it achieves regret O( p T) after playing T steps of the game, where the regret is measured by the agent s performance against a fully adversarial opponent who can exploit the agent s strategy at any step. We also introduce an explore-Then-exploit style algorithm, which achieves a slightly worse regret of O(T2=3), but is guaranteed to run in polynomial time even in the worst case. To the best of our knowledge, our work presents the first line of provably sample-efficient self-play algorithms for competitive reinforcement learning.
AB - Self-play, where the algorithm learns by playing against itself without requiring any direct supervision, has become the new weapon in modern Reinforcement Learning (RL) for achieving superhuman performance in practice. However, the majority of exisiting theory in reinforcement learning only applies to the setting where the agent plays against a fixed environment; it remains largely open whether self-play algorithms can be provably effective, especially when it is necessary to manage the exploration/exploitation tradeoff. We study self-play in competitive reinforcement learning under the setting of Markov games, a generalization of Markov decision processes to the twoplayer case. We introduce a self-play algorithm-Value Iteration with Upper/Lower Confidence Bound (VI-ULCB)-And show that it achieves regret O( p T) after playing T steps of the game, where the regret is measured by the agent s performance against a fully adversarial opponent who can exploit the agent s strategy at any step. We also introduce an explore-Then-exploit style algorithm, which achieves a slightly worse regret of O(T2=3), but is guaranteed to run in polynomial time even in the worst case. To the best of our knowledge, our work presents the first line of provably sample-efficient self-play algorithms for competitive reinforcement learning.
UR - http://www.scopus.com/inward/record.url?scp=85105133035&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85105133035&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85105133035
T3 - 37th International Conference on Machine Learning, ICML 2020
SP - 528
EP - 537
BT - 37th International Conference on Machine Learning, ICML 2020
A2 - Daume, Hal
A2 - Singh, Aarti
PB - International Machine Learning Society (IMLS)
T2 - 37th International Conference on Machine Learning, ICML 2020
Y2 - 13 July 2020 through 18 July 2020
ER -