Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Chi Jin, Praneeth Netrapalli, Michael I. Jordan

Research output: Contribution to journalConference articlepeer-review

98 Scopus citations

Abstract

Nesterov’s accelerated gradient descent (AGD), an instance of the general family of “momentum methods,” provably achieves faster convergence rate than gradient descent (GD) in the convex setting. While these methods are widely used in modern nonconvex applications, including training of deep neural networks, whether they are provably superior to GD in the nonconvex setting remains open. This paper studies a simple variant of Nesterov’s AGD, and shows that it escapes saddle points and finds a second-order stationary point in Õ(1/7/4) iterations, matching the best known convergence rate, which is faster than the Õ(1/2) iterations required by GD. To the best of our knowledge, this is the first direct acceleration (single-loop) algorithm that is provably faster than GD in general nonconvex setting—all previous nonconvex accelerated algorithms rely on more complex mechanisms such as nested loops and proximal terms. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases on each step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.

Original languageEnglish (US)
Pages (from-to)1042-1085
Number of pages44
JournalProceedings of Machine Learning Research
Volume75
StatePublished - 2018
Externally publishedYes
Event31st Annual Conference on Learning Theory, COLT 2018 - Stockholm, Sweden
Duration: Jul 6 2018Jul 9 2018

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent'. Together they form a unique fingerprint.

Cite this