TY - JOUR

T1 - Distributed Stochastic Gradient Descent

T2 - Nonconvexity, Nonsmoothness, and Convergence to Local Minima

AU - Swenson, Brian

AU - Murray, Ryan

AU - Poor, H. Vincent

AU - Kar, Soummya

N1 - Funding Information:
B.S. and V.P. wish to acknowledge the support of the U.S. National Science Foundation under Grant CCF-1908308 and a grant from the C3.ai Digital Transformation Institute.
Publisher Copyright:
©2022 Swenson, Murray, Poor, Kar.

PY - 2022/10/1

Y1 - 2022/10/1

N2 - Gradient-descent (GD) based algorithms are an indispensable tool for optimizing modern machine learning models. The paper considers distributed stochastic GD (D-SGD)—a network-based variant of GD. Distributed algorithms play an important role in large-scale machine learning problems as well as the Internet of Things (IoT) and related applications. The paper considers two main issues. First, we study convergence of D-SGD to critical points when the loss function is nonconvex and nonsmooth. We consider a broad range of nonsmooth loss functions including those of practical interest in modern deep learning. It is shown that, for each fixed initialization, D-SGD converges to critical points of the loss with probability one. Next, we consider the problem of avoiding saddle points. It is well known that classical GD avoids saddle points; however, analogous results have been absent for distributed variants of GD. For this problem, we again assume that loss functions may be nonconvex and nonsmooth, but are smooth in a neighborhood of a saddle point. It is shown that, for any fixed initialization, D-SGD avoids such saddle points with probability one. Results are proved by studying the underlying (distributed) gradient flow, using the ordinary differential equation (ODE) method of stochastic approximation.

AB - Gradient-descent (GD) based algorithms are an indispensable tool for optimizing modern machine learning models. The paper considers distributed stochastic GD (D-SGD)—a network-based variant of GD. Distributed algorithms play an important role in large-scale machine learning problems as well as the Internet of Things (IoT) and related applications. The paper considers two main issues. First, we study convergence of D-SGD to critical points when the loss function is nonconvex and nonsmooth. We consider a broad range of nonsmooth loss functions including those of practical interest in modern deep learning. It is shown that, for each fixed initialization, D-SGD converges to critical points of the loss with probability one. Next, we consider the problem of avoiding saddle points. It is well known that classical GD avoids saddle points; however, analogous results have been absent for distributed variants of GD. For this problem, we again assume that loss functions may be nonconvex and nonsmooth, but are smooth in a neighborhood of a saddle point. It is shown that, for any fixed initialization, D-SGD avoids such saddle points with probability one. Results are proved by studying the underlying (distributed) gradient flow, using the ordinary differential equation (ODE) method of stochastic approximation.

KW - distributed optimization

KW - gradient descent

KW - Nonconvex optimization

KW - saddle point

KW - stochastic optimization

UR - http://www.scopus.com/inward/record.url?scp=85148054886&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85148054886&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:85148054886

SN - 1532-4435

VL - 23

JO - Journal of Machine Learning Research

JF - Journal of Machine Learning Research

M1 - 328

ER -