Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation

Anand Padmanabha Iyer, Mingyu Guan, Yinwei Dai, Rui Pan, Swapnil Gandhi, Ravi Netravali

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

Machine learning inference platforms continue to face high request rates and strict latency constraints. Existing solutions largely focus on compressing models to substantially lower compute costs (and time) with mild accuracy degradations. This paper explores an alternate (but complementary) technique that trades off accuracy and resource costs on a perinput granularity: early exit models, which selectively allow certain inputs to exit a model from an intermediate layer. Though intuitive, early exits face fundamental deployment challenges, largely owing to the effects that exiting inputs have on batch size (and resource utilization) throughout model execution. We present E3, the first system that makes early exit models practical for realistic inference deployments. Our key insight is to split and replicate blocks of layers in models in a manner that maintains a constant batch size throughout execution, all the while accounting for resource requirements and communication overheads. Evaluations with NLP and vision models show that E3 can deliver up to 1.74× improvement in goodput (for a fixed cost) or 1.78× reduction in cost (for a fixed goodput). Additionally, E3's goodput wins generalize to autoregressive LLMs (2.8 - 3.8×) and compressed models (1.67×).

Original languageEnglish (US)
Title of host publicationSOSP 2024 - Proceedings of the 2024 ACM SIGOPS 30th Symposium on Operating Systems Principles
PublisherAssociation for Computing Machinery, Inc
Pages624-639
Number of pages16
ISBN (Electronic)9798400712517
DOIs
StatePublished - Nov 15 2024
Event30th ACM Symposium on Operating Systems Principles, SOSP 2024 - Austin, United States
Duration: Nov 4 2024Nov 6 2024

Publication series

NameSOSP 2024 - Proceedings of the 2024 ACM SIGOPS 30th Symposium on Operating Systems Principles

Conference

Conference30th ACM Symposium on Operating Systems Principles, SOSP 2024
Country/TerritoryUnited States
CityAustin
Period11/4/2411/6/24

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation'. Together they form a unique fingerprint.

Cite this