Training distributed deep recurrent neural networks with mixed precision on GPU clusters.

Alexey Svyatkovskiy, Julian Kates-Harbeck, William Tang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

16 Scopus citations

Abstract

In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution across multiple GPU nodes and making use of high-speed interconnects. We introduce a learning rate schedule facilitating neural network convergence at up to O(100) workers. Strong scaling tests performed on clusters of NVIDIA Pascal P100 GPUs show linear runtime and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions, and the benchmark Large Movie Review Dataset [11]. Half-precision significantly reduces memory and network bandwidth, allowing training of state-of-the-art models with over 70 million trainable parameters while achieving a comparable test set performance as single precision.

Original languageEnglish (US)
Title of host publicationProceedings of MLHPC 2017
Subtitle of host publicationMachine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450351379
DOIs
StatePublished - Nov 12 2017
Externally publishedYes
Event2017 Machine Learning in HPC Environments, MLHPC 2017 - Denver, United States
Duration: Nov 12 2017Nov 17 2017

Publication series

NameProceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference2017 Machine Learning in HPC Environments, MLHPC 2017
Country/TerritoryUnited States
CityDenver
Period11/12/1711/17/17

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Artificial Intelligence

Keywords

  • Distributed computing
  • Floating point precision
  • Neural networks

Fingerprint

Dive into the research topics of 'Training distributed deep recurrent neural networks with mixed precision on GPU clusters.'. Together they form a unique fingerprint.

Cite this