Abstract
Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points-it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.
Original language | English (US) |
---|---|
Pages (from-to) | 1068-1078 |
Number of pages | 11 |
Journal | Advances in Neural Information Processing Systems |
Volume | 2017-December |
State | Published - 2017 |
Externally published | Yes |
Event | 31st Annual Conference on Neural Information Processing Systems, NIPS 2017 - Long Beach, United States Duration: Dec 4 2017 → Dec 9 2017 |
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications
- Information Systems
- Signal Processing