TY - GEN

T1 - Imitation learning with a value-based prior

AU - Syed, Umar

AU - Schapire, Robert E.

PY - 2007/12/1

Y1 - 2007/12/1

N2 - The goal of imitation learning is for an apprentice to learn how to behave in a stochastic environment by observing a mentor demonstrating the correct behavior. Accurate prior knowledge about the correct behavior can reduce the need for demonstrations from the mentor. We present a novel approach to encoding prior knowledge about the correct behavior, where we assume that this prior knowledge takes the form of a Markov Decision Process (MDP) that is used by the apprentice as a rough and imperfect model of the mentor's behavior. Specifically, taking a Bayesian approach, we treat the value of a policy in this modeling MDP as the log prior probability of the policy. In other words, we assume a priori that the mentor's behavior is likely to be a high-value policy in the modeling MDP, though quite possibly different from the optimal policy. We describe an efficient algorithm that, given a modeling MDP and a set of demonstrations by a mentor, provably converges to a stationary point of the log posterior of the mentor's policy, where the posterior is computed with respect to the "value-based" prior. We also present empirical evidence that this prior does in fact speed learning of the mentor's policy, and is an improvement in our experiments over similar previous methods.

AB - The goal of imitation learning is for an apprentice to learn how to behave in a stochastic environment by observing a mentor demonstrating the correct behavior. Accurate prior knowledge about the correct behavior can reduce the need for demonstrations from the mentor. We present a novel approach to encoding prior knowledge about the correct behavior, where we assume that this prior knowledge takes the form of a Markov Decision Process (MDP) that is used by the apprentice as a rough and imperfect model of the mentor's behavior. Specifically, taking a Bayesian approach, we treat the value of a policy in this modeling MDP as the log prior probability of the policy. In other words, we assume a priori that the mentor's behavior is likely to be a high-value policy in the modeling MDP, though quite possibly different from the optimal policy. We describe an efficient algorithm that, given a modeling MDP and a set of demonstrations by a mentor, provably converges to a stationary point of the log posterior of the mentor's policy, where the posterior is computed with respect to the "value-based" prior. We also present empirical evidence that this prior does in fact speed learning of the mentor's policy, and is an improvement in our experiments over similar previous methods.

UR - http://www.scopus.com/inward/record.url?scp=80053188613&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053188613&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:80053188613

SN - 0974903930

SN - 9780974903934

T3 - Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, UAI 2007

SP - 384

EP - 391

BT - Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, UAI 2007

T2 - 23rd Conference on Uncertainty in Artificial Intelligence, UAI 2007

Y2 - 19 July 2007 through 22 July 2007

ER -