TY - GEN
T1 - Optimizing N-dimensional, winograd-based convolution for manycore CPUs
AU - Jia, Zhen
AU - Zlateski, Aleksandar
AU - Durand, Fredo
AU - Li, Kai
N1 - Publisher Copyright:
© 2018 Copyright held by the owner/author(s).
PY - 2018/2/10
Y1 - 2018/2/10
N2 - Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse performance than better optimized, direct convolution implementations. We propose and implement an algorithm for N-dimensional Winogradśbased convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs. Our algorithm achieves high hardware utilization through a series of optimizations. Our experiments show that on modern ConvNets, our optimized implementation, is on average more than 3, and sometimes 8 faster than other state-ofthe-art CPU implementations on an Intel Xeon Phi manycore processors. Moreover, our implementation on the Xeon Phi achieves competitive performance for 2D ConvNets and superior performance for 3D ConvNets, compared with the best GPU implementations.
AB - Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse performance than better optimized, direct convolution implementations. We propose and implement an algorithm for N-dimensional Winogradśbased convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs. Our algorithm achieves high hardware utilization through a series of optimizations. Our experiments show that on modern ConvNets, our optimized implementation, is on average more than 3, and sometimes 8 faster than other state-ofthe-art CPU implementations on an Intel Xeon Phi manycore processors. Moreover, our implementation on the Xeon Phi achieves competitive performance for 2D ConvNets and superior performance for 3D ConvNets, compared with the best GPU implementations.
KW - Convolution
KW - Parallelization
KW - Vectorization
KW - Winograd
UR - http://www.scopus.com/inward/record.url?scp=85044266925&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85044266925&partnerID=8YFLogxK
U2 - 10.1145/3178487.3178496
DO - 10.1145/3178487.3178496
M3 - Conference contribution
AN - SCOPUS:85044266925
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 109
EP - 123
BT - PPoPP 2018 - Proceedings of the 23rd Principles and Practice of Parallel Programming
PB - Association for Computing Machinery
T2 - 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018
Y2 - 24 February 2018 through 28 February 2018
ER -