TY - JOUR
T1 - Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis
AU - Meulemans, Alexander
AU - Schug, Simon
AU - Kobayashi, Seijin
AU - Daw, Nathaniel D.
AU - Wayne, Gregory
N1 - Publisher Copyright:
© 2023 Neural information processing systems foundation. All rights reserved.
PY - 2023
Y1 - 2023
N2 - To make reinforcement learning more sample efficient, we need better credit assignment methods that measure an action's influence on future rewards.Building upon Hindsight Credit Assignment (HCA) [1], we introduce Counterfactual Contribution Analysis (COCOA), a new family of model-based credit assignment algorithms.Our algorithms achieve precise credit assignment by measuring the contribution of actions upon obtaining subsequent rewards, by quantifying a counterfactual query: 'Would the agent still have reached this reward if it had taken another action?'.We show that measuring contributions w.r.t.rewarding states, as is done in HCA, results in spurious estimates of contributions, causing HCA to degrade towards the high-variance REINFORCE estimator in many relevant environments.Instead, we measure contributions w.r.t.rewards or learned representations of the rewarding objects, resulting in gradient estimates with lower variance.We run experiments on a suite of problems specifically designed to evaluate long-term credit assignment capabilities.By using dynamic programming, we measure ground-truth policy gradients and show that the improved performance of our new model-based credit assignment methods is due to lower bias and variance compared to HCA and common baselines.Our results demonstrate how modeling action contributions towards rewarding outcomes can be leveraged for credit assignment, opening a new path towards sample-efficient reinforcement learning.
AB - To make reinforcement learning more sample efficient, we need better credit assignment methods that measure an action's influence on future rewards.Building upon Hindsight Credit Assignment (HCA) [1], we introduce Counterfactual Contribution Analysis (COCOA), a new family of model-based credit assignment algorithms.Our algorithms achieve precise credit assignment by measuring the contribution of actions upon obtaining subsequent rewards, by quantifying a counterfactual query: 'Would the agent still have reached this reward if it had taken another action?'.We show that measuring contributions w.r.t.rewarding states, as is done in HCA, results in spurious estimates of contributions, causing HCA to degrade towards the high-variance REINFORCE estimator in many relevant environments.Instead, we measure contributions w.r.t.rewards or learned representations of the rewarding objects, resulting in gradient estimates with lower variance.We run experiments on a suite of problems specifically designed to evaluate long-term credit assignment capabilities.By using dynamic programming, we measure ground-truth policy gradients and show that the improved performance of our new model-based credit assignment methods is due to lower bias and variance compared to HCA and common baselines.Our results demonstrate how modeling action contributions towards rewarding outcomes can be leveraged for credit assignment, opening a new path towards sample-efficient reinforcement learning.
UR - http://www.scopus.com/inward/record.url?scp=85191197484&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85191197484&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85191197484
SN - 1049-5258
VL - 36
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
Y2 - 10 December 2023 through 16 December 2023
ER -