TY - GEN
T1 - A reduction from apprenticeship learning to classification
AU - Syed, Umar
AU - Schapire, Robert E.
PY - 2010
Y1 - 2010
N2 - We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert's behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ε, the difference between the value of the apprentice's policy and the expert's policy is O(√ε). Further, we prove that this difference is only O(ε) when the expert's policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain.
AB - We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert's behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ε, the difference between the value of the apprentice's policy and the expert's policy is O(√ε). Further, we prove that this difference is only O(ε) when the expert's policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain.
UR - http://www.scopus.com/inward/record.url?scp=85162059109&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85162059109&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85162059109
SN - 9781617823800
T3 - Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010, NIPS 2010
BT - Advances in Neural Information Processing Systems 23
PB - Neural Information Processing Systems
T2 - 24th Annual Conference on Neural Information Processing Systems 2010, NIPS 2010
Y2 - 6 December 2010 through 9 December 2010
ER -