Zhiyuan Li, Tianhao Wang, Sanjeev Arora

Research output: Contribution to conferencePaperpeer-review

12 Scopus citations


Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function L can form a manifold. Intuitively, with a sufficiently small learning rate η, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, tr[∇2L]. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold-i.e., the”implicit bias”-using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for η−2 steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for η−1.6 steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires O(κ ln d) samples for learning an κ-sparse overparametrized linear model in Rd (Woodworth et al., 2020), while GD initialized in the kernel regime requires Ω(d) samples. This upper bound is minimax optimal and improves the previous Oe2) upper bound (HaoChen et al., 2020).

Original languageEnglish (US)
StatePublished - 2022
Externally publishedYes
Event10th International Conference on Learning Representations, ICLR 2022 - Virtual, Online
Duration: Apr 25 2022Apr 29 2022


Conference10th International Conference on Learning Representations, ICLR 2022
CityVirtual, Online

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Computer Science Applications
  • Education
  • Linguistics and Language


Dive into the research topics of 'WHAT HAPPENS AFTER SGD REACHES ZERO LOSS? -A MATHEMATICAL FRAMEWORK'. Together they form a unique fingerprint.

Cite this