TY - JOUR
T1 - Implicit Regularization in Nonconvex Statistical Estimation
T2 - Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution
AU - Ma, Cong
AU - Wang, Kaizheng
AU - Chi, Yuejie
AU - Chen, Yuxin
N1 - Funding Information:
Y. Chen is supported in part by the AFOSR YIP Award FA9550-19-1-0030, ONR Grant N00014-19-1-2120, ARO Grant W911NF-18-1-0303, NSF Grant CCF-1907661, and the Princeton SEAS innovation award. Y. Chi is supported in part by the Grants AFOSR FA9550-15-1-0205, ONR N00014-18-1-2142 and N00014-19-1-2404, ARO W911NF-18-1-0303, NSF CCF-1826519, ECCS-1818571, CCF-1806154. Y. Chen thanks Yudong Chen for inspiring discussions about matrix completion.
Publisher Copyright:
© 2019, The Author(s).
PY - 2020/6/1
Y1 - 2020/6/1
N2 - Recent years have seen a flurry of activities in designing provably efficient nonconvex procedures for solving statistical estimation problems. Due to the highly nonconvex nature of the empirical loss, state-of-the-art procedures often require proper regularization (e.g., trimming, regularized cost, projection) in order to guarantee fast convergence. For vanilla procedures such as gradient descent, however, prior theory either recommends highly conservative learning rates to avoid overshooting, or completely lacks performance guarantees. This paper uncovers a striking phenomenon in nonconvex optimization: even in the absence of explicit regularization, gradient descent enforces proper regularization implicitly under various statistical models. In fact, gradient descent follows a trajectory staying within a basin that enjoys nice geometry, consisting of points incoherent with the sampling mechanism. This “implicit regularization” feature allows gradient descent to proceed in a far more aggressive fashion without overshooting, which in turn results in substantial computational savings. Focusing on three fundamental statistical estimation problems, i.e., phase retrieval, low-rank matrix completion, and blind deconvolution, we establish that gradient descent achieves near-optimal statistical and computational guarantees without explicit regularization. In particular, by marrying statistical modeling with generic optimization theory, we develop a general recipe for analyzing the trajectories of iterative algorithms via a leave-one-out perturbation argument. As a by-product, for noisy matrix completion, we demonstrate that gradient descent achieves near-optimal error control—measured entrywise and by the spectral norm—which might be of independent interest.
AB - Recent years have seen a flurry of activities in designing provably efficient nonconvex procedures for solving statistical estimation problems. Due to the highly nonconvex nature of the empirical loss, state-of-the-art procedures often require proper regularization (e.g., trimming, regularized cost, projection) in order to guarantee fast convergence. For vanilla procedures such as gradient descent, however, prior theory either recommends highly conservative learning rates to avoid overshooting, or completely lacks performance guarantees. This paper uncovers a striking phenomenon in nonconvex optimization: even in the absence of explicit regularization, gradient descent enforces proper regularization implicitly under various statistical models. In fact, gradient descent follows a trajectory staying within a basin that enjoys nice geometry, consisting of points incoherent with the sampling mechanism. This “implicit regularization” feature allows gradient descent to proceed in a far more aggressive fashion without overshooting, which in turn results in substantial computational savings. Focusing on three fundamental statistical estimation problems, i.e., phase retrieval, low-rank matrix completion, and blind deconvolution, we establish that gradient descent achieves near-optimal statistical and computational guarantees without explicit regularization. In particular, by marrying statistical modeling with generic optimization theory, we develop a general recipe for analyzing the trajectories of iterative algorithms via a leave-one-out perturbation argument. As a by-product, for noisy matrix completion, we demonstrate that gradient descent achieves near-optimal error control—measured entrywise and by the spectral norm—which might be of independent interest.
KW - Blind deconvolution
KW - Gradient descent
KW - Leave-one-out analysis
KW - Matrix completion
KW - Nonconvex optimization
KW - Phase retrieval
UR - http://www.scopus.com/inward/record.url?scp=85070199712&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85070199712&partnerID=8YFLogxK
U2 - 10.1007/s10208-019-09429-9
DO - 10.1007/s10208-019-09429-9
M3 - Article
AN - SCOPUS:85070199712
SN - 1615-3375
VL - 20
SP - 451
EP - 632
JO - Foundations of Computational Mathematics
JF - Foundations of Computational Mathematics
IS - 3
ER -