TY - JOUR
T1 - Shape Matters
T2 - 34th Conference on Learning Theory, COLT 2021
AU - HaoChen, Jeff Z.
AU - Wei, Colin
AU - Lee, Jason D.
AU - Ma, Tengyu
N1 - Funding Information:
JZH acknowledges support from the Enlight Foundation Graduate Fellowship. CW acknowledges support from an NSF Graduate Research Fellowship. JDL acknowledges support of the ARO under MURI Award W911NF-11-1-0303, the Sloan Research Fellowship, and NSF CCF 2002272. TM acknowledges support of Google Faculty Award. The work is also partially supported by SDSI and SAIL at Stanford.
Publisher Copyright:
© 2021 J.Z. HaoChen, C. Wei, J.D. Lee & T. Ma.
PY - 2021
Y1 - 2021
N2 - The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise — induced by mini-batches or label perturbation — is far more effective than Gaussian noise. This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al. We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
AB - The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise — induced by mini-batches or label perturbation — is far more effective than Gaussian noise. This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al. We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
KW - Implicit regularization
KW - implicit bias
KW - over-parameterization
UR - http://www.scopus.com/inward/record.url?scp=85113903359&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85113903359&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85113903359
SN - 2640-3498
VL - 134
SP - 2315
EP - 2357
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
Y2 - 15 August 2021 through 19 August 2021
ER -