TY - GEN
T1 - Compile-time optimized and statically scheduled N-D ConvNet primitives for multi-core and many-core (Xeon Phi) CPUs
AU - Zlateski, Aleksandar
AU - Seung, Hyunjune Sebastian
N1 - Publisher Copyright:
© 2017 Association for Computing Machinery.
PY - 2017/6/14
Y1 - 2017/6/14
N2 - Convolutional networks (ConvNets), largely running on GPUs, have become the most popular approach to computer vision. Now that CPUs are closing the FLOPS gap with GPUs, efficient CPU algorithms are becoming more important. We propose a novel parallel and vectorized algorithm for N-D convolutional layers. Our goal is to achieve high utilization of available FLOPS, independent of ConvNet architecture and CPU properties (e.g. vector units, number of cores, cache sizes). Our approach is to rely on the compiler to optimize code, thereby removing the need for hand-tuning. We assume that the network architecture is known at compile-time. Our serial algorithm divides the computation into small sub-tasks designed to be easily optimized by the compiler for a specific CPU. Sub-tasks are executed in an order that maximizes cache reuse.We parallelize the algorithm by statically scheduling tasks to be executed by each core. Our novel compile-time recursive scheduling algorithm is capable of dividing the computation evenly between an arbitrary number of cores, regardless of ConvNet architecture. It introduces zero runtime overhead and minimal synchronization overhead. We demonstrate that our serial primitives efficiently utilize available FLOPS (75-95%), while our parallel algorithm attains 50-90% utilization on 64+ core machines. Our algorithm is competitive with the fastest CPU implementation to date (MKL2017) for 2D object recognition, and performs much better for image segmentation. For 3D ConvNets we demonstrate comparable performance to the latest GPU hardware and software even though the CPU is only capable of half the FLOPS of the GPU.
AB - Convolutional networks (ConvNets), largely running on GPUs, have become the most popular approach to computer vision. Now that CPUs are closing the FLOPS gap with GPUs, efficient CPU algorithms are becoming more important. We propose a novel parallel and vectorized algorithm for N-D convolutional layers. Our goal is to achieve high utilization of available FLOPS, independent of ConvNet architecture and CPU properties (e.g. vector units, number of cores, cache sizes). Our approach is to rely on the compiler to optimize code, thereby removing the need for hand-tuning. We assume that the network architecture is known at compile-time. Our serial algorithm divides the computation into small sub-tasks designed to be easily optimized by the compiler for a specific CPU. Sub-tasks are executed in an order that maximizes cache reuse.We parallelize the algorithm by statically scheduling tasks to be executed by each core. Our novel compile-time recursive scheduling algorithm is capable of dividing the computation evenly between an arbitrary number of cores, regardless of ConvNet architecture. It introduces zero runtime overhead and minimal synchronization overhead. We demonstrate that our serial primitives efficiently utilize available FLOPS (75-95%), while our parallel algorithm attains 50-90% utilization on 64+ core machines. Our algorithm is competitive with the fastest CPU implementation to date (MKL2017) for 2D object recognition, and performs much better for image segmentation. For 3D ConvNets we demonstrate comparable performance to the latest GPU hardware and software even though the CPU is only capable of half the FLOPS of the GPU.
UR - http://www.scopus.com/inward/record.url?scp=85023750833&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85023750833&partnerID=8YFLogxK
U2 - 10.1145/3079079.3079081
DO - 10.1145/3079079.3079081
M3 - Conference contribution
AN - SCOPUS:85023750833
T3 - Proceedings of the International Conference on Supercomputing
BT - ICS 2017
PB - Association for Computing Machinery
T2 - 31st ACM International Conference on Supercomputing, ICS 2017
Y2 - 13 June 2017 through 16 June 2017
ER -