Convolutional networks (ConvNets), largely running on GPUs, have become the most popular approach to computer vision. Now that CPUs are closing the FLOPS gap with GPUs, efficient CPU algorithms are becoming more important. We propose a novel parallel and vectorized algorithm for N-D convolutional layers. Our goal is to achieve high utilization of available FLOPS, independent of ConvNet architecture and CPU properties (e.g. vector units, number of cores, cache sizes). Our approach is to rely on the compiler to optimize code, thereby removing the need for hand-tuning. We assume that the network architecture is known at compile-time. Our serial algorithm divides the computation into small sub-tasks designed to be easily optimized by the compiler for a specific CPU. Sub-tasks are executed in an order that maximizes cache reuse.We parallelize the algorithm by statically scheduling tasks to be executed by each core. Our novel compile-time recursive scheduling algorithm is capable of dividing the computation evenly between an arbitrary number of cores, regardless of ConvNet architecture. It introduces zero runtime overhead and minimal synchronization overhead. We demonstrate that our serial primitives efficiently utilize available FLOPS (75-95%), while our parallel algorithm attains 50-90% utilization on 64+ core machines. Our algorithm is competitive with the fastest CPU implementation to date (MKL2017) for 2D object recognition, and performs much better for image segmentation. For 3D ConvNets we demonstrate comparable performance to the latest GPU hardware and software even though the CPU is only capable of half the FLOPS of the GPU.