TY - GEN
T1 - Label Noise SGD Provably Prefers Flat Global Minimizers
AU - Damian, Alex
AU - Ma, Tengyu
AU - Lee, Jason
N1 - Publisher Copyright:
© 2021 Neural information processing systems foundation. All rights reserved.
PY - 2021
Y1 - 2021
N2 - In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training with noisy labels improves generalization, we study the implicit regularization effect of SGD with label noise. We show that SGD with label noise converges to a stationary point of a regularized loss L(θ) + λR(θ), where L(θ) is the training loss, λ is an effective regularization parameter depending on the step size, strength of the label noise, and the batch size, and R(θ) is an explicit regularizer that penalizes sharp minimizers. Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones. We also prove extensions to classification with general loss functions, significantly strengthening the prior work of Blanc et al. [3] to global convergence and large learning rates and of HaoChen et al. [12] to general models.
AB - In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training with noisy labels improves generalization, we study the implicit regularization effect of SGD with label noise. We show that SGD with label noise converges to a stationary point of a regularized loss L(θ) + λR(θ), where L(θ) is the training loss, λ is an effective regularization parameter depending on the step size, strength of the label noise, and the batch size, and R(θ) is an explicit regularizer that penalizes sharp minimizers. Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones. We also prove extensions to classification with general loss functions, significantly strengthening the prior work of Blanc et al. [3] to global convergence and large learning rates and of HaoChen et al. [12] to general models.
UR - http://www.scopus.com/inward/record.url?scp=85131890838&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85131890838&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85131890838
T3 - Advances in Neural Information Processing Systems
SP - 27449
EP - 27461
BT - Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
A2 - Ranzato, Marc'Aurelio
A2 - Beygelzimer, Alina
A2 - Dauphin, Yann
A2 - Liang, Percy S.
A2 - Wortman Vaughan, Jenn
PB - Neural information processing systems foundation
T2 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
Y2 - 6 December 2021 through 14 December 2021
ER -