TY - GEN
T1 - Intermittent Communications in Decentralized Shadow Reward Actor-Critic
AU - Bedi, Amrit Singh
AU - Koppel, Alec
AU - Wang, Mengdi
AU - Zhang, Junyu
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Broader decision-making goals such as risk-sensitivity, exploration, and incorporating prior experience motivates the study of cooperative multi-agent reinforcement learning (MARL) problems where the objective is any nonlinear function of the team's long-term state-action occupancy measure, i.e., a general utility, which subsumes the aforementioned goals. Existing decentralized actor-critic algorithms to solve this problem require extensive message passing per policy update, which may be impractical. Thus, we put forth Communication-Efficient Decentralized Shadow Reward Actor-Critic (CE-DSAC) that may operate with time-varying or event-triggered network connectivities. This scheme operates by having agents to alternate between policy evaluation (critic), weighted averaging with neighbors (information mixing), and local gradient updates for their policy parameters (actor). CE-DSAC is different from the usual critic update in its local occupancy measure estimation step which is needed to estimate the derivative of the local utility with respect to their occupancy measure, i.e., the "shadow reward,"and the amount of local weighted averaging steps executed by agents. This scheme improves existing tradeoffs between communications and convergence: to obtain ϵ-stationarity, we require in {mathcal{O}}left({1/{ in {2.5}}}right) (Theorem IV.6) or faster {mathcal{O}}left({1/{ in 2}}right) (Corollary IV.8) steps with high probability. Experiments demonstrate the merits of this approach for multiple RL agents solving cooperative navigation tasks with intermittent communications.
AB - Broader decision-making goals such as risk-sensitivity, exploration, and incorporating prior experience motivates the study of cooperative multi-agent reinforcement learning (MARL) problems where the objective is any nonlinear function of the team's long-term state-action occupancy measure, i.e., a general utility, which subsumes the aforementioned goals. Existing decentralized actor-critic algorithms to solve this problem require extensive message passing per policy update, which may be impractical. Thus, we put forth Communication-Efficient Decentralized Shadow Reward Actor-Critic (CE-DSAC) that may operate with time-varying or event-triggered network connectivities. This scheme operates by having agents to alternate between policy evaluation (critic), weighted averaging with neighbors (information mixing), and local gradient updates for their policy parameters (actor). CE-DSAC is different from the usual critic update in its local occupancy measure estimation step which is needed to estimate the derivative of the local utility with respect to their occupancy measure, i.e., the "shadow reward,"and the amount of local weighted averaging steps executed by agents. This scheme improves existing tradeoffs between communications and convergence: to obtain ϵ-stationarity, we require in {mathcal{O}}left({1/{ in {2.5}}}right) (Theorem IV.6) or faster {mathcal{O}}left({1/{ in 2}}right) (Corollary IV.8) steps with high probability. Experiments demonstrate the merits of this approach for multiple RL agents solving cooperative navigation tasks with intermittent communications.
UR - http://www.scopus.com/inward/record.url?scp=85126013337&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85126013337&partnerID=8YFLogxK
U2 - 10.1109/CDC45484.2021.9682939
DO - 10.1109/CDC45484.2021.9682939
M3 - Conference contribution
AN - SCOPUS:85126013337
T3 - Proceedings of the IEEE Conference on Decision and Control
SP - 2613
EP - 2620
BT - 60th IEEE Conference on Decision and Control, CDC 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 60th IEEE Conference on Decision and Control, CDC 2021
Y2 - 13 December 2021 through 17 December 2021
ER -