Towards Execution-Efficient LSTMs via Hardware-Guided Grow-and-Prune Paradigm

Hongxu Yin, Guoyang Chen, Yingmin Li, Shuai Che, Weifeng Zhang, Niraj K. Jha

Research output: Contribution to journalArticlepeer-review

Abstract

Long short-term memory (LSTM) applications need fast yet compact models. Neural network compression approaches have been promising for cutting down network complexity by skipping insignificant weights. However, current strategies remain hardware-agnostic and network complexity reduction does not always translate to execution efficiency. We propose a hardware-guided symbiotic training methodology for compact, accurate, yet execution-efficient inference models. It is based on our observation that hardware may introduce substantial non-monotonic behavior, which we call the latency hysteresis effect, when evaluating network size vs. latency. Leveraging hardware-impacted hysteresis effect and sparsity, we enable a symbiosis of model compactness and accuracy with execution efficiency. We have evaluated our approach on language modeling and speech recognition applications. Relative to the traditional LSTM architecture obtained for the Penn Treebank dataset, we reduce the number of parameters by 18.0x (30.5x) and run-time latency by up to 2.4x (5.2x) on Nvidia GPUs (Intel CPUs) without any accuracy degradation. For the DeepSpeech2 architecture obtained for the AN4 dataset, we reduce the model size by 7.0x (19.4x), word error rate from 12.9% to 9.9% (10.4%), and run-time latency by up to 1.7x (2.4x) on Nvidia GPUs (Intel CPUs). Our method consistently outperforms prior art, yielding compact, accurate, yet execution-efficient models.

Original languageEnglish (US)
JournalIEEE Transactions on Emerging Topics in Computing
DOIs
StateAccepted/In press - 2021

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Information Systems
  • Human-Computer Interaction
  • Computer Science Applications

Keywords

  • Artificial neural networks
  • Computational modeling
  • Computer architecture
  • Deep learning
  • grow-and-prune synthesis
  • Hardware
  • language modeling
  • long short-term memory
  • neural network
  • Sparse matrices
  • speech recognition
  • stacked architecture
  • Symbiosis
  • Training

Fingerprint

Dive into the research topics of 'Towards Execution-Efficient LSTMs via Hardware-Guided Grow-and-Prune Paradigm'. Together they form a unique fingerprint.

Cite this