Compile-time optimized and statically scheduled N-D ConvNet primitives for multi-core and many-core (Xeon Phi) CPUs

Aleksandar Zlateski, Hyunjune Sebastian Seung

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

Convolutional networks (ConvNets), largely running on GPUs, have become the most popular approach to computer vision. Now that CPUs are closing the FLOPS gap with GPUs, efficient CPU algorithms are becoming more important. We propose a novel parallel and vectorized algorithm for N-D convolutional layers. Our goal is to achieve high utilization of available FLOPS, independent of ConvNet architecture and CPU properties (e.g. vector units, number of cores, cache sizes). Our approach is to rely on the compiler to optimize code, thereby removing the need for hand-tuning. We assume that the network architecture is known at compile-time. Our serial algorithm divides the computation into small sub-tasks designed to be easily optimized by the compiler for a specific CPU. Sub-tasks are executed in an order that maximizes cache reuse.We parallelize the algorithm by statically scheduling tasks to be executed by each core. Our novel compile-time recursive scheduling algorithm is capable of dividing the computation evenly between an arbitrary number of cores, regardless of ConvNet architecture. It introduces zero runtime overhead and minimal synchronization overhead. We demonstrate that our serial primitives efficiently utilize available FLOPS (75-95%), while our parallel algorithm attains 50-90% utilization on 64+ core machines. Our algorithm is competitive with the fastest CPU implementation to date (MKL2017) for 2D object recognition, and performs much better for image segmentation. For 3D ConvNets we demonstrate comparable performance to the latest GPU hardware and software even though the CPU is only capable of half the FLOPS of the GPU.

Original languageEnglish (US)
Title of host publicationICS 2017
Subtitle of host publicationInternational Conference on Supercomputing
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450350204
DOIs
StatePublished - Jun 14 2017
Event31st ACM International Conference on Supercomputing, ICS 2017 - Chicago, United States
Duration: Jun 13 2017Jun 16 2017

Publication series

NameProceedings of the International Conference on Supercomputing
VolumePart F128411

Other

Other31st ACM International Conference on Supercomputing, ICS 2017
CountryUnited States
CityChicago
Period6/13/176/16/17

All Science Journal Classification (ASJC) codes

  • Computer Science(all)

Fingerprint Dive into the research topics of 'Compile-time optimized and statically scheduled N-D ConvNet primitives for multi-core and many-core (Xeon Phi) CPUs'. Together they form a unique fingerprint.

Cite this