Skip to main navigation Skip to search Skip to main content

Reward-Free Exploration for Reinforcement Learning

  • Chi Jin
  • , Akshay Krishnamurthy
  • , Max Simchowitz
  • , Tiancheng Yu

Research output: Contribution to journalConference articlepeer-review

Abstract

Exploration is widely regarded as one of the most challenging aspects of reinforcement learning (RL), with many naive approaches succumbing to exponential sample complexity. To isolate the challenges of exploration, we propose a new “reward-free RL” framework. In the exploration phase, the agent first collects trajectories from an MDPM without aprespecified reward function. After exploration, it is tasked with computing near-optimal policies under for M for a collection of given reward functions. This frame work is particularly suitable when there are many reward functions of interest, or when the reward function is shaped by an external agent to elicit desired behavior. We give an efficient algorithm that conducts Õ(S2Apoly(H)|ϵ2)episodes of exploration and returns-suboptimal policies for an arbitrary number of reward functions. We achieve this by finding exploratory policies that visit each “significant” state with probability proportional to its maximum visitation probability under any possible policy. Moreover, our planning procedure can be instantiated by any black-box approximate planner, such as value iteration or natural policy gradient. We also give a nearly-matching Ω(S2AH22) lower bound, demonstrating the near-optimality of our algorithm in this setting.

Original languageEnglish (US)
JournalProceedings of Machine Learning Research
Volume119
StatePublished - 2020
Event37th International Conference on Machine Learning, ICML 2020 - Virtual, Online
Duration: Jul 13 2020Jul 18 2020

All Science Journal Classification (ASJC) codes

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Reward-Free Exploration for Reinforcement Learning'. Together they form a unique fingerprint.

Cite this