TY - GEN
T1 - Shampoo
T2 - 35th International Conference on Machine Learning, ICML 2018
AU - Gupta, Vineet
AU - Koren, Tomer
AU - Singer, Yoram
N1 - Publisher Copyright:
© 2018 35th International Conference on Machine Learning, ICML 2018. All rights reserved.
PY - 2018
Y1 - 2018
N2 - Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state- of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Surprisingly, although it involves a more complex update rule, Shampoo's runtime per step is comparable in practice to that of simple gradient methods such as SGD, AdaGrad, and Adam.
AB - Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state- of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Surprisingly, although it involves a more complex update rule, Shampoo's runtime per step is comparable in practice to that of simple gradient methods such as SGD, AdaGrad, and Adam.
UR - http://www.scopus.com/inward/record.url?scp=85057324679&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85057324679&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85057324679
T3 - 35th International Conference on Machine Learning, ICML 2018
SP - 2956
EP - 2964
BT - 35th International Conference on Machine Learning, ICML 2018
A2 - Dy, Jennifer
A2 - Krause, Andreas
PB - International Machine Learning Society (IMLS)
Y2 - 10 July 2018 through 15 July 2018
ER -