Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima

Brian Swenson, Ryan Murray, H. Vincent Poor, Soummya Kar

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

Gradient-descent (GD) based algorithms are an indispensable tool for optimizing modern machine learning models. The paper considers distributed stochastic GD (D-SGD)—a network-based variant of GD. Distributed algorithms play an important role in large-scale machine learning problems as well as the Internet of Things (IoT) and related applications. The paper considers two main issues. First, we study convergence of D-SGD to critical points when the loss function is nonconvex and nonsmooth. We consider a broad range of nonsmooth loss functions including those of practical interest in modern deep learning. It is shown that, for each fixed initialization, D-SGD converges to critical points of the loss with probability one. Next, we consider the problem of avoiding saddle points. It is well known that classical GD avoids saddle points; however, analogous results have been absent for distributed variants of GD. For this problem, we again assume that loss functions may be nonconvex and nonsmooth, but are smooth in a neighborhood of a saddle point. It is shown that, for any fixed initialization, D-SGD avoids such saddle points with probability one. Results are proved by studying the underlying (distributed) gradient flow, using the ordinary differential equation (ODE) method of stochastic approximation.

Original languageEnglish (US)
Article number328
JournalJournal of Machine Learning Research
Volume23
StatePublished - Oct 1 2022
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Software
  • Artificial Intelligence
  • Control and Systems Engineering
  • Statistics and Probability

Keywords

  • Nonconvex optimization
  • distributed optimization
  • gradient descent
  • saddle point
  • stochastic optimization

Fingerprint

Dive into the research topics of 'Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima'. Together they form a unique fingerprint.

Cite this