TY - GEN
T1 - Reward-Free Exploration for Reinforcement Learning
AU - Jin, Chi
AU - Krishnamurthy, Akshay
AU - Simchowitz, Max
AU - Yu, Tiancheng
N1 - Publisher Copyright:
© 2020 by the Authors.
PY - 2020
Y1 - 2020
N2 - Exploration is widely regarded as one of the most challenging aspects of reinforcement learning (RL), with many naive approaches succumbing to exponential sample complexity. To isolate the challenges of exploration, we propose a new "reward-free RL" framework. In the exploration phase, the agent first collects trajectories from an MDPMwithout a pre-specified reward function. After exploration, it is tasked with computing near-optimal policies under for M for a collection of given reward functions. This framework is particularly suitable when there are many reward functions of interest, or when the reward function is shaped by an external agent to elicit desired behavior. We give an efficient algorithm that conducts O (S2Apoly(H)= 2) episodes of exploration and returns-suboptimal policies for an arbitrary number of reward functions. We achieve this by finding exploratory policies that visit each "significant" state with probability proportional to its maximum visitation probability under any possible policy. Moreover, our planning procedure can be instantiated by any black-box approximate planner, such as value iteration or natural policy gradient. We also give a nearly-matching (S2AH2=2) lower bound, demonstrating the near-optimality of our algorithm in this setting.
AB - Exploration is widely regarded as one of the most challenging aspects of reinforcement learning (RL), with many naive approaches succumbing to exponential sample complexity. To isolate the challenges of exploration, we propose a new "reward-free RL" framework. In the exploration phase, the agent first collects trajectories from an MDPMwithout a pre-specified reward function. After exploration, it is tasked with computing near-optimal policies under for M for a collection of given reward functions. This framework is particularly suitable when there are many reward functions of interest, or when the reward function is shaped by an external agent to elicit desired behavior. We give an efficient algorithm that conducts O (S2Apoly(H)= 2) episodes of exploration and returns-suboptimal policies for an arbitrary number of reward functions. We achieve this by finding exploratory policies that visit each "significant" state with probability proportional to its maximum visitation probability under any possible policy. Moreover, our planning procedure can be instantiated by any black-box approximate planner, such as value iteration or natural policy gradient. We also give a nearly-matching (S2AH2=2) lower bound, demonstrating the near-optimality of our algorithm in this setting.
UR - http://www.scopus.com/inward/record.url?scp=85105124866&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85105124866&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85105124866
T3 - 37th International Conference on Machine Learning, ICML 2020
SP - 4820
EP - 4829
BT - 37th International Conference on Machine Learning, ICML 2020
A2 - Daume, Hal
A2 - Singh, Aarti
PB - International Machine Learning Society (IMLS)
T2 - 37th International Conference on Machine Learning, ICML 2020
Y2 - 13 July 2020 through 18 July 2020
ER -