WHAT HAPPENS AFTER SGD REACHES ZERO LOSS? -A MATHEMATICAL FRAMEWORK

Zhiyuan Li, Tianhao Wang, Sanjeev Arora

Research output: Contribution to conferencePaperpeer-review

16 Scopus citations

Abstract

Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function L can form a manifold. Intuitively, with a sufficiently small learning rate η, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, tr[∇2L]. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold-i.e., the”implicit bias”-using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for η−2 steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for η−1.6 steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires O(κ ln d) samples for learning an κ-sparse overparametrized linear model in Rd (Woodworth et al., 2020), while GD initialized in the kernel regime requires Ω(d) samples. This upper bound is minimax optimal and improves the previous Oe2) upper bound (HaoChen et al., 2020).

Original languageEnglish (US)
StatePublished - 2022
Event10th International Conference on Learning Representations, ICLR 2022 - Virtual, Online
Duration: Apr 25 2022Apr 29 2022

Conference

Conference10th International Conference on Learning Representations, ICLR 2022
CityVirtual, Online
Period4/25/224/29/22

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Computer Science Applications
  • Education
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'WHAT HAPPENS AFTER SGD REACHES ZERO LOSS? -A MATHEMATICAL FRAMEWORK'. Together they form a unique fingerprint.

Cite this