TY - GEN
T1 - Imitation learning with a value-based prior
AU - Syed, Umar
AU - Schapire, Robert E.
PY - 2007
Y1 - 2007
N2 - The goal of imitation learning is for an apprentice to learn how to behave in a stochastic environment by observing a mentor demonstrating the correct behavior. Accurate prior knowledge about the correct behavior can reduce the need for demonstrations from the mentor. We present a novel approach to encoding prior knowledge about the correct behavior, where we assume that this prior knowledge takes the form of a Markov Decision Process (MDP) that is used by the apprentice as a rough and imperfect model of the mentor's behavior. Specifically, taking a Bayesian approach, we treat the value of a policy in this modeling MDP as the log prior probability of the policy. In other words, we assume a priori that the mentor's behavior is likely to be a high-value policy in the modeling MDP, though quite possibly different from the optimal policy. We describe an efficient algorithm that, given a modeling MDP and a set of demonstrations by a mentor, provably converges to a stationary point of the log posterior of the mentor's policy, where the posterior is computed with respect to the "value-based" prior. We also present empirical evidence that this prior does in fact speed learning of the mentor's policy, and is an improvement in our experiments over similar previous methods.
AB - The goal of imitation learning is for an apprentice to learn how to behave in a stochastic environment by observing a mentor demonstrating the correct behavior. Accurate prior knowledge about the correct behavior can reduce the need for demonstrations from the mentor. We present a novel approach to encoding prior knowledge about the correct behavior, where we assume that this prior knowledge takes the form of a Markov Decision Process (MDP) that is used by the apprentice as a rough and imperfect model of the mentor's behavior. Specifically, taking a Bayesian approach, we treat the value of a policy in this modeling MDP as the log prior probability of the policy. In other words, we assume a priori that the mentor's behavior is likely to be a high-value policy in the modeling MDP, though quite possibly different from the optimal policy. We describe an efficient algorithm that, given a modeling MDP and a set of demonstrations by a mentor, provably converges to a stationary point of the log posterior of the mentor's policy, where the posterior is computed with respect to the "value-based" prior. We also present empirical evidence that this prior does in fact speed learning of the mentor's policy, and is an improvement in our experiments over similar previous methods.
UR - http://www.scopus.com/inward/record.url?scp=80053188613&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80053188613&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:80053188613
SN - 0974903930
SN - 9780974903934
T3 - Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, UAI 2007
SP - 384
EP - 391
BT - Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, UAI 2007
T2 - 23rd Conference on Uncertainty in Artificial Intelligence, UAI 2007
Y2 - 19 July 2007 through 22 July 2007
ER -