TY - GEN

T1 - Communication lower bounds for statistical estimation problems via a distributed data processing inequality?

AU - Braverman, Mark

AU - Garg, Ankit

AU - Ma, Tengyu

AU - Nguyen, Huy L.

AU - Woodruff, David P.

N1 - Publisher Copyright:
© 2016 ACM. 978-1-4503-4132-5/16/06...$15.00.

PY - 2016/6/19

Y1 - 2016/6/19

N2 - We study the tradeoff between the statistical error and communication cost of distributed statistical estimation problems in high dimensions. In the distributed sparse Gaussian mean estimation problem, each of the m machines receives n data points from a d-dimensional Gaussian distribution with unknown mean θ which is promised to be k-sparse. The machines communicate by message passing and aim to estimate the mean θ We provide a tight (up to logarithmic factors) tradeoff between the estimation error and the number of bits communicated between the machines. This directly leads to a lower bound for the distributed sparse linear regression problem: to achieve the statistical minimax error, the total communication is at least?(min{n,d}m), where n is the number of observations that each machine receives and d is the ambient dimension. These lower bound results improve upon Shamir (NIPS'14) and Steinhardt, Duchi (COLT'15) by allowing a multi-round interactive communication model. We also give the first optimal simultaneous protocol in the dense case for mean estimation. As our main technique, we prove a distributed data processing inequality, as a generalization of usual data processing inequalities, which might be of independent interest and useful for other problems.

AB - We study the tradeoff between the statistical error and communication cost of distributed statistical estimation problems in high dimensions. In the distributed sparse Gaussian mean estimation problem, each of the m machines receives n data points from a d-dimensional Gaussian distribution with unknown mean θ which is promised to be k-sparse. The machines communicate by message passing and aim to estimate the mean θ We provide a tight (up to logarithmic factors) tradeoff between the estimation error and the number of bits communicated between the machines. This directly leads to a lower bound for the distributed sparse linear regression problem: to achieve the statistical minimax error, the total communication is at least?(min{n,d}m), where n is the number of observations that each machine receives and d is the ambient dimension. These lower bound results improve upon Shamir (NIPS'14) and Steinhardt, Duchi (COLT'15) by allowing a multi-round interactive communication model. We also give the first optimal simultaneous protocol in the dense case for mean estimation. As our main technique, we prove a distributed data processing inequality, as a generalization of usual data processing inequalities, which might be of independent interest and useful for other problems.

KW - Communication complexity

KW - Information complexity

KW - Statistical estimation

UR - http://www.scopus.com/inward/record.url?scp=84979258003&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84979258003&partnerID=8YFLogxK

U2 - 10.1145/2897518.2897582

DO - 10.1145/2897518.2897582

M3 - Conference contribution

AN - SCOPUS:84979258003

T3 - Proceedings of the Annual ACM Symposium on Theory of Computing

SP - 1011

EP - 1020

BT - STOC 2016 - Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing

A2 - Mansour, Yishay

A2 - Wichs, Daniel

PB - Association for Computing Machinery

T2 - 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016

Y2 - 19 June 2016 through 21 June 2016

ER -