TY - GEN
T1 - Shaping Model-Free Habits with Model-Based Goals
AU - Krueger, Paul M.
AU - Griffiths, Thomas L.
N1 - Publisher Copyright:
© 2018 Proceedings of the 40th Annual Meeting of the Cognitive Science Society, CogSci 2018. All rights reserved.
PY - 2018
Y1 - 2018
N2 - Model-free (MF) and model-based (MB) reinforcement learning (RL) have provided a successful framework for understanding both human behavior and neural data. These two systems are usually thought to compete for control of behavior. However, it has also been proposed that they can be integrated in a cooperative manner. For example, the Dyna algorithm uses MB replay of past experience to train the MF system, and has inspired research examining whether human learners do something similar. Here we introduce an approach that links MF and MB learning in a new way: via the reward function. Given a model of the learning environment, dynamic programming is used to iteratively approximate state values that monotonically converge to the state values under the optimal decision policy. Pseudorewards are calculated from these values and used to shape the reward function of a MF learner in a way that is guaranteed not to change the optimal policy. We show that this method offers computational advantages over Dyna in two classic problems. It also offers a new way to think about integrating MF and MB RL: that our knowledge of the world doesn't just provide a source of simulated experience for training our instincts, but that it shapes the rewards that those instincts latch onto. We discuss psychological phenomena that this theory could apply to, including moral emotions.
AB - Model-free (MF) and model-based (MB) reinforcement learning (RL) have provided a successful framework for understanding both human behavior and neural data. These two systems are usually thought to compete for control of behavior. However, it has also been proposed that they can be integrated in a cooperative manner. For example, the Dyna algorithm uses MB replay of past experience to train the MF system, and has inspired research examining whether human learners do something similar. Here we introduce an approach that links MF and MB learning in a new way: via the reward function. Given a model of the learning environment, dynamic programming is used to iteratively approximate state values that monotonically converge to the state values under the optimal decision policy. Pseudorewards are calculated from these values and used to shape the reward function of a MF learner in a way that is guaranteed not to change the optimal policy. We show that this method offers computational advantages over Dyna in two classic problems. It also offers a new way to think about integrating MF and MB RL: that our knowledge of the world doesn't just provide a source of simulated experience for training our instincts, but that it shapes the rewards that those instincts latch onto. We discuss psychological phenomena that this theory could apply to, including moral emotions.
UR - http://www.scopus.com/inward/record.url?scp=85082051215&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85082051215&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85082051215
T3 - Proceedings of the 40th Annual Meeting of the Cognitive Science Society, CogSci 2018
SP - 1975
EP - 1980
BT - Proceedings of the 40th Annual Meeting of the Cognitive Science Society, CogSci 2018
PB - The Cognitive Science Society
T2 - 40th Annual Meeting of the Cognitive Science Society: Changing Minds, CogSci 2018
Y2 - 25 July 2018 through 28 July 2018
ER -