TY - GEN
T1 - Optimistic MLE
T2 - 55th Annual ACM Symposium on Theory of Computing, STOC 2023
AU - Liu, Qinghua
AU - Netrapalli, Praneeth
AU - Szepesvari, Csaba
AU - Jin, Chi
N1 - Publisher Copyright:
© 2023 Owner/Author.
PY - 2023/6/2
Y1 - 2023/6/2
N2 - This paper introduces a simple efficient learning algorithms for general sequential decision making. The algorithm combines Optimism for exploration with Maximum Likelihood Estimation for model estimation, which is thus named OMLE. We prove that OMLE learns the near-optimal policies of an enormously rich class of sequential decision making problems in a polynomial number of samples. This rich class includes not only a majority of known tractable model-based Reinforcement Learning (RL) problems (such as tabular MDPs, factored MDPs, low witness rank problems, tabular weakly-revealing/observable POMDPs and multi-step decodable POMDPs ), but also many new challenging RL problems especially in the partially observable setting that were not previously known to be tractable. Notably, the new problems addressed by this paper include (1) observable POMDPs with continuous observation and function approximation, where we achieve the first sample complexity that is completely independent of the size of observation space; (2) well-conditioned low-rank sequential decision making problems (also known as Predictive State Representations (PSRs)), which include and generalize all known tractable POMDP examples under a more intrinsic representation; (3) general sequential decision making problems under SAIL condition, which unifies our existing understandings of model-based RL in both fully observable and partially observable settings. SAIL condition is identified by this paper, which can be viewed as a natural generalization of Bellman/witness rank to address partial observability. This paper also presents a reward-free variant of OMLE algorithm, which learns approximate dynamic models that enable the computation of near-optimal policies for all reward functions simultaneously.
AB - This paper introduces a simple efficient learning algorithms for general sequential decision making. The algorithm combines Optimism for exploration with Maximum Likelihood Estimation for model estimation, which is thus named OMLE. We prove that OMLE learns the near-optimal policies of an enormously rich class of sequential decision making problems in a polynomial number of samples. This rich class includes not only a majority of known tractable model-based Reinforcement Learning (RL) problems (such as tabular MDPs, factored MDPs, low witness rank problems, tabular weakly-revealing/observable POMDPs and multi-step decodable POMDPs ), but also many new challenging RL problems especially in the partially observable setting that were not previously known to be tractable. Notably, the new problems addressed by this paper include (1) observable POMDPs with continuous observation and function approximation, where we achieve the first sample complexity that is completely independent of the size of observation space; (2) well-conditioned low-rank sequential decision making problems (also known as Predictive State Representations (PSRs)), which include and generalize all known tractable POMDP examples under a more intrinsic representation; (3) general sequential decision making problems under SAIL condition, which unifies our existing understandings of model-based RL in both fully observable and partially observable settings. SAIL condition is identified by this paper, which can be viewed as a natural generalization of Bellman/witness rank to address partial observability. This paper also presents a reward-free variant of OMLE algorithm, which learns approximate dynamic models that enable the computation of near-optimal policies for all reward functions simultaneously.
KW - Optimistic MLE
KW - POMDPs
KW - PSRs
KW - Reinforcement Learning
UR - http://www.scopus.com/inward/record.url?scp=85163122227&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85163122227&partnerID=8YFLogxK
U2 - 10.1145/3564246.3585161
DO - 10.1145/3564246.3585161
M3 - Conference contribution
AN - SCOPUS:85163122227
T3 - Proceedings of the Annual ACM Symposium on Theory of Computing
SP - 363
EP - 376
BT - STOC 2023 - Proceedings of the 55th Annual ACM Symposium on Theory of Computing
A2 - Saha, Barna
A2 - Servedio, Rocco A.
PB - Association for Computing Machinery
Y2 - 20 June 2023 through 23 June 2023
ER -