Abstract
We show that unconverged stochastic gradient descent can be interpreted as sampling from a nonparametric approximate posterior distribution. This distribution is implicitly defined by the transformation of an initial distribution by a sequence of optimization steps. By tracking the change in entropy of this distribution during optimization, we give a scalable, unbiased estimate of a variational lower bound on the log marginal likelihood. This bound can be used to optimize hyperparameters instead of cross-validation. This Bayesian interpretation of SGD also suggests new overfitting-resistant optimization procedures, and gives a theoretical foundation for early stopping and ensembling. We investigate the properties of this marginal likelihood estimator on neural network models.
| Original language | English (US) |
|---|---|
| Pages | 1070-1077 |
| Number of pages | 8 |
| State | Published - 2016 |
| Externally published | Yes |
| Event | 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016 - Cadiz, Spain Duration: May 9 2016 → May 11 2016 |
Conference
| Conference | 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016 |
|---|---|
| Country/Territory | Spain |
| City | Cadiz |
| Period | 5/9/16 → 5/11/16 |
All Science Journal Classification (ASJC) codes
- Artificial Intelligence
- Statistics and Probability