On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs)

Zhiyuan Li, Sadhika Malladi, Sanjeev Arora

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Itô Stochastic Differential Equations (SDEs), but formal justification for this approximation (e.g., (Li et al., 2019a)) only applies to SGD with tiny LR. Experimental verification of the approximation appears computationally infeasible. The current paper clarifies the picture with the following contributions: (a) An efficient simulation algorithm SVAG that provably converges to the conventionally used Itô SDE approximation. (b) A theoretically motivated testable necessary condition for the SDE approximation and its most famous implication, the linear scaling rule (Goyal et al., 2017), to hold. (c) Experiments using this simulation to demonstrate that the previously proposed SDE approximation can meaningfully capture the training and generalization properties of common deep nets.

Original languageEnglish (US)
Title of host publicationAdvances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
EditorsMarc'Aurelio Ranzato, Alina Beygelzimer, Yann Dauphin, Percy S. Liang, Jenn Wortman Vaughan
PublisherNeural information processing systems foundation
Pages12712-12725
Number of pages14
ISBN (Electronic)9781713845393
StatePublished - 2021
Event35th Conference on Neural Information Processing Systems, NeurIPS 2021 - Virtual, Online
Duration: Dec 6 2021Dec 14 2021

Publication series

NameAdvances in Neural Information Processing Systems
Volume16
ISSN (Print)1049-5258

Conference

Conference35th Conference on Neural Information Processing Systems, NeurIPS 2021
CityVirtual, Online
Period12/6/2112/14/21

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Fingerprint

Dive into the research topics of 'On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs)'. Together they form a unique fingerprint.

Cite this