TY - JOUR
T1 - Theoretical insights into the optimization landscape of over-parameterized shallow neural networks
AU - Soltanolkotabi, Mahdi
AU - Javanmard, Adel
AU - Lee, Jason D.
N1 - Funding Information:
Manuscript received August 7, 2017; revised March 25, 2018 and June 18, 2018; accepted June 20, 2018. Date of publication July 10, 2018; date of current version January 18, 2019. M. Soltanolkotabi was supported in part by the Simons Institute for the Theory of Computing, in part by the Air Force Office of Scientific Research Young Investigator Program (AFOSR-YIP) under Award FA9550-18-1-0078 and NSF Award #1813877, in part by the Google Faculty Research Award, and in part by the Northrop Grumman Cybersecurity Research Consortium. A. Javanmard was supported in part by the Google Faculty Research Award and in part by the Office of the Provost at the University of Southern California through the Zumberge Fund Individual Grant Program.
Publisher Copyright:
© 2018 IEEE.
PY - 2019/2
Y1 - 2019/2
N2 - In this paper, we study the problem of learning a shallow artificial neural network that best fits a training data set. We study this problem in the over-parameterized regime where the numbers of observations are fewer than the number of parameters in the model. We show that with the quadratic activations, the optimization landscape of training, such shallow neural networks, has certain favorable characteristics that allow globally optimal models to be found efficiently using a variety of local search heuristics. This result holds for an arbitrary training data of input/output pairs. For differentiable activation functions, we also show that gradient descent, when suitably initialized, converges at a linear rate to a globally optimal model. This result focuses on a realizable model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to planted weight coefficients.
AB - In this paper, we study the problem of learning a shallow artificial neural network that best fits a training data set. We study this problem in the over-parameterized regime where the numbers of observations are fewer than the number of parameters in the model. We show that with the quadratic activations, the optimization landscape of training, such shallow neural networks, has certain favorable characteristics that allow globally optimal models to be found efficiently using a variety of local search heuristics. This result holds for an arbitrary training data of input/output pairs. For differentiable activation functions, we also show that gradient descent, when suitably initialized, converges at a linear rate to a globally optimal model. This result focuses on a realizable model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to planted weight coefficients.
KW - Nonconvex optimization
KW - over-parametrized neural networks
KW - random matrix theory
UR - http://www.scopus.com/inward/record.url?scp=85049796568&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85049796568&partnerID=8YFLogxK
U2 - 10.1109/TIT.2018.2854560
DO - 10.1109/TIT.2018.2854560
M3 - Article
AN - SCOPUS:85049796568
SN - 0018-9448
VL - 65
SP - 742
EP - 769
JO - IRE Professional Group on Information Theory
JF - IRE Professional Group on Information Theory
IS - 2
M1 - 8409482
ER -