Theoretical analysis of auto rate-tuning by batch normalization

Sanjeev Arora, Kaifeng Lyu, Zhiyuan Li

Research output: Contribution to conferencePaperpeer-review

Abstract

Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuning of learning rates. It is shown that even if we fix the learning rate of scale-invariant parameters (e.g., weights of each layer with BN) to a constant (say, 0.3), gradient descent still approaches a stationary point (i.e., a solution where gradient is zero) in the rate of T1/2 in T iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates. A similar result with convergence rate T1/4 is also shown for stochastic gradient descent.

Original languageEnglish (US)
StatePublished - Jan 1 2019
Event7th International Conference on Learning Representations, ICLR 2019 - New Orleans, United States
Duration: May 6 2019May 9 2019

Conference

Conference7th International Conference on Learning Representations, ICLR 2019
Country/TerritoryUnited States
CityNew Orleans
Period5/6/195/9/19

All Science Journal Classification (ASJC) codes

  • Education
  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Theoretical analysis of auto rate-tuning by batch normalization'. Together they form a unique fingerprint.

Cite this