TY - GEN
T1 - Training distributed deep recurrent neural networks with mixed precision on GPU clusters.
AU - Svyatkovskiy, Alexey
AU - Kates-Harbeck, Julian
AU - Tang, William
N1 - Publisher Copyright:
© 2017 Association for Computing Machinery.
PY - 2017/11/12
Y1 - 2017/11/12
N2 - In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution across multiple GPU nodes and making use of high-speed interconnects. We introduce a learning rate schedule facilitating neural network convergence at up to O(100) workers. Strong scaling tests performed on clusters of NVIDIA Pascal P100 GPUs show linear runtime and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions, and the benchmark Large Movie Review Dataset [11]. Half-precision significantly reduces memory and network bandwidth, allowing training of state-of-the-art models with over 70 million trainable parameters while achieving a comparable test set performance as single precision.
AB - In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution across multiple GPU nodes and making use of high-speed interconnects. We introduce a learning rate schedule facilitating neural network convergence at up to O(100) workers. Strong scaling tests performed on clusters of NVIDIA Pascal P100 GPUs show linear runtime and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions, and the benchmark Large Movie Review Dataset [11]. Half-precision significantly reduces memory and network bandwidth, allowing training of state-of-the-art models with over 70 million trainable parameters while achieving a comparable test set performance as single precision.
KW - Distributed computing
KW - Floating point precision
KW - Neural networks
UR - https://www.scopus.com/pages/publications/85058272490
UR - https://www.scopus.com/inward/citedby.url?scp=85058272490&partnerID=8YFLogxK
U2 - 10.1145/3146347.3146358
DO - 10.1145/3146347.3146358
M3 - Conference contribution
AN - SCOPUS:85058272490
T3 - Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis
BT - Proceedings of MLHPC 2017
PB - Association for Computing Machinery, Inc
T2 - 2017 Machine Learning in HPC Environments, MLHPC 2017
Y2 - 12 November 2017 through 17 November 2017
ER -