## Abstract

We study a decentralized cooperative multi-agent multi-armed bandit (MAB) problem with K arms and N agents connected over a network. In this model, each arm's reward distribution is the same for every agent, and rewards are drawn independently across agents and over time steps. At each iteration, agents independently choose an arm to play and exchange at most mathsf {poly}(K) real-valued messages with their neighbors. Existing lower bound on the average per-agent regret over the network shows that cooperation with other agents can achieve a reduction of O\left({\frac {1}{N}}right) over when playing in isolation. Motivated by this, we study a message-passing algorithm that can be combined with existing Bayesian MAB algorithms. Using this, we propose a decentralized Thompson Sampling (TS) algorithm and a decentralized Bayes-UCB algorithm. Under decentralized TS for bounded rewards, we establish a problem-dependent upper bound on the average per-agent regret in terms of the number of agents in the network and the network topology. For Bernoulli rewards, our upper bound asymptotically matches the lower bound. We empirically show that the proposed decentralized TS algorithm incurs significantly lesser per-agent regret than previously proposed algorithms. Furthermore, we show that the proposed decentralized TS can be extended to general bandit problems, where posterior distribution cannot be computed in closed form. We combine our decentralized TS algorithm with Variational Inference to apply our proposed algorithm to complex realistic reward distributions in a computationally efficient manner. We implement our proposed decentralized TS under gossip protocol and over time-varying networks, where each communication link has a fixed probability of failure and show that it incurs logarithmic regret.

Original language | English (US) |
---|---|

Article number | 9434413 |

Pages (from-to) | 564-583 |

Number of pages | 20 |

Journal | IEEE Journal on Selected Areas in Information Theory |

Volume | 2 |

Issue number | 2 |

DOIs | |

State | Published - Jun 2021 |

Externally published | Yes |

## All Science Journal Classification (ASJC) codes

- Computer Networks and Communications
- Media Technology
- Artificial Intelligence
- Applied Mathematics

## Keywords

- Multi-armed bandits
- Thompson sampling
- cooperative stochastic bandits
- decentralized Bayesian learning
- sequential decision-making