TY - JOUR

T1 - Bayesian interpolation with deep linear networks

AU - Hanin, Boris

AU - Zlokapa, Alexander

N1 - Funding Information:
ACKNOWLEDGMENTS. This work started at the 2022 Summer School on the Statistical Physics of Machine Learning held at École de Physique des Houches. We are grateful for the wonderful atmosphere at the school and would like to express our appreciation to the session organizers Florent Krzakala and Lenka Zdeborová as well as to Haim Sompolinsky for his series of lectures on Bayesian analysis of deep linear networks. We further thank Edward George for pointing out the connection between our work and the deep Gaussian process literature. Finally, we thank Matias Cattaneo, Isaac Chuang, David Dunson, Jianqing Fan, Aram Harrow, Jason Klusowski, Cengiz Pehlevan, Veronika Rockova, and Jacob Zavatone-Veth for their feedback and suggestions. B.H. is supported by NSF grants DMS-2143754, DMS-1855684, and DMS-2133806. A.Z. is supported by the Hertz Foundation, and by the DoD NDSEG. We also thank two anonymous reviewers for improving aspects of the exposition and for pointing out a range of typos in the original manuscript.
Publisher Copyright:
Copyright © 2023 the Author(s).

PY - 2023/6

Y1 - 2023/6

N2 - Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We give here a complete solution in the special case of linear networks with output dimension one trained using zero noise Bayesian inference with Gaussian weight priors and mean squared error as a negative log-likelihood. For any training dataset, network depth, and hidden layer widths, we find nonasymptotic expressions for the predictive posterior and Bayesian model evidence in terms of Meijer-G functions, a class of meromorphic special functions of a single complex variable. Through asymptotic expansions of these Meijer-G functions, a rich new picture of the joint role of depth, width, and dataset size emerges. We show that linear networks make provably optimal predictions at infinite depth: the posterior of infinitely deep linear networks with data-agnostic priors is the same as that of shallow networks with evidence-maximizing data-dependent priors. This yields a principled reason to prefer deeper networks when priors are forced to be data-agnostic. Moreover, we show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth, elucidating the salutary role of increased depth for model selection. Underpinning our results is an emergent notion of effective depth, given by the number of hidden layers times the number of data points divided by the network width; this determines the structure of the posterior in the large-data limit.

AB - Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We give here a complete solution in the special case of linear networks with output dimension one trained using zero noise Bayesian inference with Gaussian weight priors and mean squared error as a negative log-likelihood. For any training dataset, network depth, and hidden layer widths, we find nonasymptotic expressions for the predictive posterior and Bayesian model evidence in terms of Meijer-G functions, a class of meromorphic special functions of a single complex variable. Through asymptotic expansions of these Meijer-G functions, a rich new picture of the joint role of depth, width, and dataset size emerges. We show that linear networks make provably optimal predictions at infinite depth: the posterior of infinitely deep linear networks with data-agnostic priors is the same as that of shallow networks with evidence-maximizing data-dependent priors. This yields a principled reason to prefer deeper networks when priors are forced to be data-agnostic. Moreover, we show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth, elucidating the salutary role of increased depth for model selection. Underpinning our results is an emergent notion of effective depth, given by the number of hidden layers times the number of data points divided by the network width; this determines the structure of the posterior in the large-data limit.

KW - Bayesian inference

KW - deep learning

KW - linear networks

KW - neural networks

UR - http://www.scopus.com/inward/record.url?scp=85160637654&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85160637654&partnerID=8YFLogxK

U2 - 10.1073/pnas.2301345120

DO - 10.1073/pnas.2301345120

M3 - Article

C2 - 37252994

AN - SCOPUS:85160637654

SN - 0027-8424

VL - 120

JO - Proceedings of the National Academy of Sciences of the United States of America

JF - Proceedings of the National Academy of Sciences of the United States of America

IS - 23

M1 - e2301345120

ER -