On the local minima of the empirical risk

Chi Jin, Rong Ge, Lydia T. Liu, Michael I. Jordan

Research output: Contribution to journalConference articlepeer-review

27 Scopus citations

Abstract

Population risk is always of primary interest in machine learning; however, learning algorithms only have access to the empirical risk. Even for applications with nonconvex nonsmooth losses (such as modern deep networks), the population risk is generally significantly more well-behaved from an optimization point of view than the empirical risk. In particular, sampling can create many spurious local minima. We consider a general framework which aims to optimize a smooth nonconvex function F (population risk) given only access to an approximation f (empirical risk) that is pointwise close to F (i.e., kF − fk ≤ ν). Our objective is to find the -approximate local minima of the underlying function F while avoiding the shallow local minima-arising because of the tolerance ν-which exist only in f. We propose a simple algorithm based on stochastic gradient descent (SGD) on a smoothed version of f that is guaranteed to achieve our goal as long as ν ≤ O(1.5/d). We also provide an almost matching lower bound showing that our algorithm achieves optimal error tolerance ν among all algorithms making a polynomial number of queries of f. As a concrete example, we show that our results can be directly used to give sample complexities for learning a ReLU unit.

Original languageEnglish (US)
Pages (from-to)4896-4905
Number of pages10
JournalAdvances in Neural Information Processing Systems
Volume2018-December
StatePublished - 2018
Event32nd Conference on Neural Information Processing Systems, NeurIPS 2018 - Montreal, Canada
Duration: Dec 2 2018Dec 8 2018

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Fingerprint

Dive into the research topics of 'On the local minima of the empirical risk'. Together they form a unique fingerprint.

Cite this