Abstract
Long short-term memory (LSTM) applications need fast yet compact models. Neural network compression approaches, e.g., the grow-and-prune paradigm, have proved to be promising for cutting down network complexity by skipping insignificant weights. However, current compression strategies remain mostly hardware-agnostic and network complexity reduction does not always translate to execution efficiency. In this work, we propose a hardware-guided symbiotic training methodology for compact, accurate, yet execution-efficient inference models. It is based on our observation that hardware may introduce substantial non-monotonic behavior, which we call the latency hysteresis effect, when evaluating network size versus inference latency. This observation raises question about the mainstream smaller-dimension-is-better compression strategy, which often leads to a sub-optimal model architecture. Leveraging the hardware-impacted hysteresis effect and sparsity, we enable a symbiosis of model compactness and accuracy with execution efficiency, thus reducing LSTM latency while increasing its accuracy. We have evaluated our approach on language modeling and speech recognition applications. Relative to the traditional stacked LSTM architecture obtained for the Penn Treebank dataset, we reduce the number of parameters by 18.0× (30.5×) and measured run-time latency by up to 2.4× (5.2×) on Nvidia GPUs (Intel Xeon CPUs) without any accuracy degradation. For the DeepSpeech2 architecture obtained for the AN4 dataset, we reduce the model size by 7.0× (19.4×), word error rate from 12.9% to 9.9% (10.4%), and measured run-time latency by up to 1.7× (2.4×) on Nvidia GPUs (Intel Xeon CPUs). Our method consistently outperforms prior art for both applications, with compact, accurate, and execution-efficient inference models.
Original language | English (US) |
---|---|
Pages (from-to) | 1799-1809 |
Number of pages | 11 |
Journal | IEEE Transactions on Emerging Topics in Computing |
Volume | 10 |
Issue number | 4 |
DOIs | |
State | Published - Oct 1 2022 |
All Science Journal Classification (ASJC) codes
- Computer Science (miscellaneous)
- Information Systems
- Human-Computer Interaction
- Computer Science Applications
Keywords
- Deep learning
- grow-and-prune synthesis
- language modeling
- long short-term memory
- neural network
- speech recognition
- stacked architecture