Optimizing N-dimensional, winograd-based convolution for manycore CPUs

Zhen Jia, Aleksandar Zlateski, Fredo Durand, Kai Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse performance than better optimized, direct convolution implementations. We propose and implement an algorithm for N-dimensional Winogradśbased convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs. Our algorithm achieves high hardware utilization through a series of optimizations. Our experiments show that on modern ConvNets, our optimized implementation, is on average more than 3, and sometimes 8 faster than other state-ofthe-art CPU implementations on an Intel Xeon Phi manycore processors. Moreover, our implementation on the Xeon Phi achieves competitive performance for 2D ConvNets and superior performance for 3D ConvNets, compared with the best GPU implementations.

Original languageEnglish (US)
Title of host publicationPPoPP 2018 - Proceedings of the 23rd Principles and Practice of Parallel Programming
PublisherAssociation for Computing Machinery
Pages109-123
Number of pages15
ISBN (Electronic)9781450349826
DOIs
StatePublished - Feb 10 2018
Event23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018 - Vienna, Austria
Duration: Feb 24 2018Feb 28 2018

Publication series

NameProceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP

Other

Other23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018
CountryAustria
CityVienna
Period2/24/182/28/18

Fingerprint

Convolution
Program processors
Computer hardware
Computational complexity
Experiments

All Science Journal Classification (ASJC) codes

  • Software

Cite this

Jia, Z., Zlateski, A., Durand, F., & Li, K. (2018). Optimizing N-dimensional, winograd-based convolution for manycore CPUs. In PPoPP 2018 - Proceedings of the 23rd Principles and Practice of Parallel Programming (pp. 109-123). (Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP). Association for Computing Machinery. https://doi.org/10.1145/3178487.3178496
Jia, Zhen ; Zlateski, Aleksandar ; Durand, Fredo ; Li, Kai. / Optimizing N-dimensional, winograd-based convolution for manycore CPUs. PPoPP 2018 - Proceedings of the 23rd Principles and Practice of Parallel Programming. Association for Computing Machinery, 2018. pp. 109-123 (Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP).
@inproceedings{6140bef97f114b2f896c429c74557568,
title = "Optimizing N-dimensional, winograd-based convolution for manycore CPUs",
abstract = "Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse performance than better optimized, direct convolution implementations. We propose and implement an algorithm for N-dimensional Winogradśbased convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs. Our algorithm achieves high hardware utilization through a series of optimizations. Our experiments show that on modern ConvNets, our optimized implementation, is on average more than 3, and sometimes 8 faster than other state-ofthe-art CPU implementations on an Intel Xeon Phi manycore processors. Moreover, our implementation on the Xeon Phi achieves competitive performance for 2D ConvNets and superior performance for 3D ConvNets, compared with the best GPU implementations.",
author = "Zhen Jia and Aleksandar Zlateski and Fredo Durand and Kai Li",
year = "2018",
month = "2",
day = "10",
doi = "10.1145/3178487.3178496",
language = "English (US)",
series = "Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP",
publisher = "Association for Computing Machinery",
pages = "109--123",
booktitle = "PPoPP 2018 - Proceedings of the 23rd Principles and Practice of Parallel Programming",

}

Jia, Z, Zlateski, A, Durand, F & Li, K 2018, Optimizing N-dimensional, winograd-based convolution for manycore CPUs. in PPoPP 2018 - Proceedings of the 23rd Principles and Practice of Parallel Programming. Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, Association for Computing Machinery, pp. 109-123, 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018, Vienna, Austria, 2/24/18. https://doi.org/10.1145/3178487.3178496

Optimizing N-dimensional, winograd-based convolution for manycore CPUs. / Jia, Zhen; Zlateski, Aleksandar; Durand, Fredo; Li, Kai.

PPoPP 2018 - Proceedings of the 23rd Principles and Practice of Parallel Programming. Association for Computing Machinery, 2018. p. 109-123 (Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Optimizing N-dimensional, winograd-based convolution for manycore CPUs

AU - Jia, Zhen

AU - Zlateski, Aleksandar

AU - Durand, Fredo

AU - Li, Kai

PY - 2018/2/10

Y1 - 2018/2/10

N2 - Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse performance than better optimized, direct convolution implementations. We propose and implement an algorithm for N-dimensional Winogradśbased convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs. Our algorithm achieves high hardware utilization through a series of optimizations. Our experiments show that on modern ConvNets, our optimized implementation, is on average more than 3, and sometimes 8 faster than other state-ofthe-art CPU implementations on an Intel Xeon Phi manycore processors. Moreover, our implementation on the Xeon Phi achieves competitive performance for 2D ConvNets and superior performance for 3D ConvNets, compared with the best GPU implementations.

AB - Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse performance than better optimized, direct convolution implementations. We propose and implement an algorithm for N-dimensional Winogradśbased convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs. Our algorithm achieves high hardware utilization through a series of optimizations. Our experiments show that on modern ConvNets, our optimized implementation, is on average more than 3, and sometimes 8 faster than other state-ofthe-art CPU implementations on an Intel Xeon Phi manycore processors. Moreover, our implementation on the Xeon Phi achieves competitive performance for 2D ConvNets and superior performance for 3D ConvNets, compared with the best GPU implementations.

UR - http://www.scopus.com/inward/record.url?scp=85044266925&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85044266925&partnerID=8YFLogxK

U2 - 10.1145/3178487.3178496

DO - 10.1145/3178487.3178496

M3 - Conference contribution

T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP

SP - 109

EP - 123

BT - PPoPP 2018 - Proceedings of the 23rd Principles and Practice of Parallel Programming

PB - Association for Computing Machinery

ER -

Jia Z, Zlateski A, Durand F, Li K. Optimizing N-dimensional, winograd-based convolution for manycore CPUs. In PPoPP 2018 - Proceedings of the 23rd Principles and Practice of Parallel Programming. Association for Computing Machinery. 2018. p. 109-123. (Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP). https://doi.org/10.1145/3178487.3178496