TY - JOUR
T1 - Gradient descent can take exponential time to Escape saddle points
AU - Du, Simon S.
AU - Jin, Chi
AU - Lee, Jason D.
AU - Jordan, Michael I.
AU - Póczos, Barnabás
AU - Singh, Aarti
N1 - Funding Information:
S.S.D. and B.P. were supported by NSF grant IIS1563887 and ARPA-E Terra program. C.J. and M.I.J. were supported by the Mathematical Data Science program of the Office of Naval Research under grant number N00014-15-1-2670. J.D.L. was supported by ARO W911NF-17-1-0304. A.S. was supported by DARPA grant D17AP00001, AFRL grant FA8750-17-2-0212 and a CMU ProS-EED/BrainHub Seed Grant. The authors thank Rong Ge, Qing Qu, John Wright, Elad Hazan, Sham Kakade, Benjamin Recht, Nathan Srebro, and Lin Xiao for useful discussions. The authors thank Stephen Wright and Michael O’Neill for pointing out calculation errors in the older version.
Publisher Copyright:
© 2017 Neural information processing systems foundation. All rights reserved.
PY - 2017
Y1 - 2017
N2 - Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points-it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.
AB - Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points-it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.
UR - http://www.scopus.com/inward/record.url?scp=85047008217&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85047008217&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85047008217
SN - 1049-5258
VL - 2017-December
SP - 1068
EP - 1078
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 31st Annual Conference on Neural Information Processing Systems, NIPS 2017
Y2 - 4 December 2017 through 9 December 2017
ER -