TY - GEN
T1 - An efficient, generalized bellman update for cooperative inverse reinforcement learning
AU - Malik, Dhruv
AU - Palaniappan, Malayandi
AU - Fisac, Jaime F.
AU - Hadfield-Menell, Dylan
AU - Russell, Stuart
AU - Dragan, Anca D.
N1 - Publisher Copyright:
© The Author(s) 2018.
PY - 2018
Y1 - 2018
N2 - Our goal is for AI systems to correctly identify and act according to their human user's objec-tives. Cooperative Inverse Reinforcement Learning (CIRL) formalizes this value alignment problem as a two-player game between a human and robot, in which only the human knows the pa-rameters of the reward function: The robot needs to learn them as the interaction unfolds. Previ-ous work showed that CIRL can be solved as a POMDP, but with an action space size exponential in the size of the reward parameter space. In this work, we exploit a specific property of CIRL-the human is a full information agent-to derive an optimality-preserving modification to the standard Bellman update; this reduces the complexity of the problem by an exponential factor and allows us to relax CIRL's assumption of human rationality. We apply this update to a variety of POMDP solvers and find that it enables us to scale CIRL to non-trivial problems, with larger reward parame-ter spaces, and larger action spaces for both robot and human. In solutions to these larger problems, the human exhibits pedagogic behavior, while the robot interprets it as such and attains higher value for the human.
AB - Our goal is for AI systems to correctly identify and act according to their human user's objec-tives. Cooperative Inverse Reinforcement Learning (CIRL) formalizes this value alignment problem as a two-player game between a human and robot, in which only the human knows the pa-rameters of the reward function: The robot needs to learn them as the interaction unfolds. Previ-ous work showed that CIRL can be solved as a POMDP, but with an action space size exponential in the size of the reward parameter space. In this work, we exploit a specific property of CIRL-the human is a full information agent-to derive an optimality-preserving modification to the standard Bellman update; this reduces the complexity of the problem by an exponential factor and allows us to relax CIRL's assumption of human rationality. We apply this update to a variety of POMDP solvers and find that it enables us to scale CIRL to non-trivial problems, with larger reward parame-ter spaces, and larger action spaces for both robot and human. In solutions to these larger problems, the human exhibits pedagogic behavior, while the robot interprets it as such and attains higher value for the human.
UR - http://www.scopus.com/inward/record.url?scp=85057282681&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85057282681&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85057282681
T3 - 35th International Conference on Machine Learning, ICML 2018
SP - 5435
EP - 5443
BT - 35th International Conference on Machine Learning, ICML 2018
A2 - Dy, Jennifer
A2 - Krause, Andreas
PB - International Machine Learning Society (IMLS)
T2 - 35th International Conference on Machine Learning, ICML 2018
Y2 - 10 July 2018 through 15 July 2018
ER -