Abstract
The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise — induced by mini-batches or label perturbation — is far more effective than Gaussian noise. This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al. We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
Original language | English (US) |
---|---|
Pages (from-to) | 2315-2357 |
Number of pages | 43 |
Journal | Proceedings of Machine Learning Research |
Volume | 134 |
State | Published - 2021 |
Event | 34th Conference on Learning Theory, COLT 2021 - Boulder, United States Duration: Aug 15 2021 → Aug 19 2021 |
All Science Journal Classification (ASJC) codes
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability
Keywords
- Implicit regularization
- implicit bias
- over-parameterization