TY - GEN
T1 - ZNNi
T2 - 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016
AU - Zlateski, Aleksandar
AU - Lee, Kisuk
AU - Seung, Hyunjune Sebastian
PY - 2016/7/2
Y1 - 2016/7/2
N2 - Sliding window convolutional networks (ConvNets) have become a popular approach to computer vision problems such as image segmentation and object detection and localization. Here we consider the parallelization of inference, i.e., the application of a previously trained ConvNet, with emphasis on 3D images. Our goal is to maximize throughput, defined as the number of output voxels computed per unit time. We propose CPU and GPU primitives for convolutional and pooling layers, which are combined to create CPU, GPU, and CPU-GPU inference algorithms. The primitives include convolution based on highly efficient padded and pruned FFTs. Our theoretical analyses and empirical tests reveal a number of interesting findings. For example, adding host RAM can be a more efficient way of increasing throughput than adding another GPU or more CPUs. Furthermore, our CPU-GPU algorithm can achieve greater throughput than the sum of CPU-only and GPU-only throughputs.
AB - Sliding window convolutional networks (ConvNets) have become a popular approach to computer vision problems such as image segmentation and object detection and localization. Here we consider the parallelization of inference, i.e., the application of a previously trained ConvNet, with emphasis on 3D images. Our goal is to maximize throughput, defined as the number of output voxels computed per unit time. We propose CPU and GPU primitives for convolutional and pooling layers, which are combined to create CPU, GPU, and CPU-GPU inference algorithms. The primitives include convolution based on highly efficient padded and pruned FFTs. Our theoretical analyses and empirical tests reveal a number of interesting findings. For example, adding host RAM can be a more efficient way of increasing throughput than adding another GPU or more CPUs. Furthermore, our CPU-GPU algorithm can achieve greater throughput than the sum of CPU-only and GPU-only throughputs.
UR - http://www.scopus.com/inward/record.url?scp=85017258678&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85017258678&partnerID=8YFLogxK
U2 - 10.1109/SC.2016.72
DO - 10.1109/SC.2016.72
M3 - Conference contribution
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
SP - 854
EP - 865
BT - Proceedings of SC 2016
PB - IEEE Computer Society
Y2 - 13 November 2016 through 18 November 2016
ER -