TY - GEN

T1 - Intermittent Communications in Decentralized Shadow Reward Actor-Critic

AU - Bedi, Amrit Singh

AU - Koppel, Alec

AU - Wang, Mengdi

AU - Zhang, Junyu

N1 - Publisher Copyright:
© 2021 IEEE.

PY - 2021

Y1 - 2021

N2 - Broader decision-making goals such as risk-sensitivity, exploration, and incorporating prior experience motivates the study of cooperative multi-agent reinforcement learning (MARL) problems where the objective is any nonlinear function of the team's long-term state-action occupancy measure, i.e., a general utility, which subsumes the aforementioned goals. Existing decentralized actor-critic algorithms to solve this problem require extensive message passing per policy update, which may be impractical. Thus, we put forth Communication-Efficient Decentralized Shadow Reward Actor-Critic (CE-DSAC) that may operate with time-varying or event-triggered network connectivities. This scheme operates by having agents to alternate between policy evaluation (critic), weighted averaging with neighbors (information mixing), and local gradient updates for their policy parameters (actor). CE-DSAC is different from the usual critic update in its local occupancy measure estimation step which is needed to estimate the derivative of the local utility with respect to their occupancy measure, i.e., the "shadow reward,"and the amount of local weighted averaging steps executed by agents. This scheme improves existing tradeoffs between communications and convergence: to obtain ϵ-stationarity, we require in {mathcal{O}}left({1/{ in {2.5}}}right) (Theorem IV.6) or faster {mathcal{O}}left({1/{ in 2}}right) (Corollary IV.8) steps with high probability. Experiments demonstrate the merits of this approach for multiple RL agents solving cooperative navigation tasks with intermittent communications.

AB - Broader decision-making goals such as risk-sensitivity, exploration, and incorporating prior experience motivates the study of cooperative multi-agent reinforcement learning (MARL) problems where the objective is any nonlinear function of the team's long-term state-action occupancy measure, i.e., a general utility, which subsumes the aforementioned goals. Existing decentralized actor-critic algorithms to solve this problem require extensive message passing per policy update, which may be impractical. Thus, we put forth Communication-Efficient Decentralized Shadow Reward Actor-Critic (CE-DSAC) that may operate with time-varying or event-triggered network connectivities. This scheme operates by having agents to alternate between policy evaluation (critic), weighted averaging with neighbors (information mixing), and local gradient updates for their policy parameters (actor). CE-DSAC is different from the usual critic update in its local occupancy measure estimation step which is needed to estimate the derivative of the local utility with respect to their occupancy measure, i.e., the "shadow reward,"and the amount of local weighted averaging steps executed by agents. This scheme improves existing tradeoffs between communications and convergence: to obtain ϵ-stationarity, we require in {mathcal{O}}left({1/{ in {2.5}}}right) (Theorem IV.6) or faster {mathcal{O}}left({1/{ in 2}}right) (Corollary IV.8) steps with high probability. Experiments demonstrate the merits of this approach for multiple RL agents solving cooperative navigation tasks with intermittent communications.

UR - http://www.scopus.com/inward/record.url?scp=85126013337&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85126013337&partnerID=8YFLogxK

U2 - 10.1109/CDC45484.2021.9682939

DO - 10.1109/CDC45484.2021.9682939

M3 - Conference contribution

AN - SCOPUS:85126013337

T3 - Proceedings of the IEEE Conference on Decision and Control

SP - 2613

EP - 2620

BT - 60th IEEE Conference on Decision and Control, CDC 2021

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 60th IEEE Conference on Decision and Control, CDC 2021

Y2 - 13 December 2021 through 17 December 2021

ER -