TY - JOUR
T1 - Distributed cooperative decision making in multi-agent multi-armed bandits
AU - Landgren, Peter
AU - Srivastava, Vaibhav
AU - Leonard, Naomi Ehrich
N1 - Funding Information:
This research has been supported in part by ONR, USA grants N00014-14-1-0635 , N00014-19-1-2556 , ARO, USA grants W911NF-14-1-0431 , W911NF-18-1-0325 and by the Department of Defense (DoD), USA through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program, USA . The material in this paper was partially presented at: the 15th annual European Control Conference, June 29–July 1, 2016, Aalborg, Denmark. the 55th IEEE Conference on Decision and Control, December 12–14, 2016, Las Vegas, NV, USA. This paper was recommended for publication in revised form by Associate Editor Luca Schenato under the direction of Editor Christos G. Cassandras.
Publisher Copyright:
© 2020 Elsevier Ltd
PY - 2021/3
Y1 - 2021/3
N2 - We study a distributed decision-making problem in which multiple agents face the same multi-armed bandit (MAB), and each agent makes sequential choices among arms to maximize its own individual reward. The agents cooperate by sharing their estimates over a fixed communication graph. We consider an unconstrained reward model in which two or more agents can choose the same arm and collect independent rewards. And we consider a constrained reward model in which agents that choose the same arm at the same time receive no reward. We design a dynamic, consensus-based, distributed estimation algorithm for cooperative estimation of mean rewards at each arm. We leverage the estimates from this algorithm to develop two distributed algorithms: coop-UCB2 and coop-UCB2-selective-learning, for the unconstrained and constrained reward models, respectively. We show that both algorithms achieve group performance close to the performance of a centralized fusion center. Further, we investigate the influence of the communication graph structure on performance. We propose a novel graph explore–exploit index that predicts the relative performance of groups in terms of the communication graph, and we propose a novel nodal explore–exploit centrality index that predicts the relative performance of agents in terms of the agent locations in the communication graph.
AB - We study a distributed decision-making problem in which multiple agents face the same multi-armed bandit (MAB), and each agent makes sequential choices among arms to maximize its own individual reward. The agents cooperate by sharing their estimates over a fixed communication graph. We consider an unconstrained reward model in which two or more agents can choose the same arm and collect independent rewards. And we consider a constrained reward model in which agents that choose the same arm at the same time receive no reward. We design a dynamic, consensus-based, distributed estimation algorithm for cooperative estimation of mean rewards at each arm. We leverage the estimates from this algorithm to develop two distributed algorithms: coop-UCB2 and coop-UCB2-selective-learning, for the unconstrained and constrained reward models, respectively. We show that both algorithms achieve group performance close to the performance of a centralized fusion center. Further, we investigate the influence of the communication graph structure on performance. We propose a novel graph explore–exploit index that predicts the relative performance of groups in terms of the communication graph, and we propose a novel nodal explore–exploit centrality index that predicts the relative performance of agents in terms of the agent locations in the communication graph.
KW - Distributed decision making
KW - Explore–exploit dilemma
KW - Multi-agent systems
KW - Multi-armed bandits
UR - http://www.scopus.com/inward/record.url?scp=85098978495&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098978495&partnerID=8YFLogxK
U2 - 10.1016/j.automatica.2020.109445
DO - 10.1016/j.automatica.2020.109445
M3 - Article
AN - SCOPUS:85098978495
SN - 0005-1098
VL - 125
JO - Automatica
JF - Automatica
M1 - 109445
ER -