TY - GEN
T1 - Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation
AU - Padmanabha Iyer, Anand
AU - Guan, Mingyu
AU - Dai, Yinwei
AU - Pan, Rui
AU - Gandhi, Swapnil
AU - Netravali, Ravi
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/11/15
Y1 - 2024/11/15
N2 - Machine learning inference platforms continue to face high request rates and strict latency constraints. Existing solutions largely focus on compressing models to substantially lower compute costs (and time) with mild accuracy degradations. This paper explores an alternate (but complementary) technique that trades off accuracy and resource costs on a perinput granularity: early exit models, which selectively allow certain inputs to exit a model from an intermediate layer. Though intuitive, early exits face fundamental deployment challenges, largely owing to the effects that exiting inputs have on batch size (and resource utilization) throughout model execution. We present E3, the first system that makes early exit models practical for realistic inference deployments. Our key insight is to split and replicate blocks of layers in models in a manner that maintains a constant batch size throughout execution, all the while accounting for resource requirements and communication overheads. Evaluations with NLP and vision models show that E3 can deliver up to 1.74× improvement in goodput (for a fixed cost) or 1.78× reduction in cost (for a fixed goodput). Additionally, E3's goodput wins generalize to autoregressive LLMs (2.8 - 3.8×) and compressed models (1.67×).
AB - Machine learning inference platforms continue to face high request rates and strict latency constraints. Existing solutions largely focus on compressing models to substantially lower compute costs (and time) with mild accuracy degradations. This paper explores an alternate (but complementary) technique that trades off accuracy and resource costs on a perinput granularity: early exit models, which selectively allow certain inputs to exit a model from an intermediate layer. Though intuitive, early exits face fundamental deployment challenges, largely owing to the effects that exiting inputs have on batch size (and resource utilization) throughout model execution. We present E3, the first system that makes early exit models practical for realistic inference deployments. Our key insight is to split and replicate blocks of layers in models in a manner that maintains a constant batch size throughout execution, all the while accounting for resource requirements and communication overheads. Evaluations with NLP and vision models show that E3 can deliver up to 1.74× improvement in goodput (for a fixed cost) or 1.78× reduction in cost (for a fixed goodput). Additionally, E3's goodput wins generalize to autoregressive LLMs (2.8 - 3.8×) and compressed models (1.67×).
UR - http://www.scopus.com/inward/record.url?scp=85215525365&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85215525365&partnerID=8YFLogxK
U2 - 10.1145/3694715.3695978
DO - 10.1145/3694715.3695978
M3 - Conference contribution
AN - SCOPUS:85215525365
T3 - SOSP 2024 - Proceedings of the 2024 ACM SIGOPS 30th Symposium on Operating Systems Principles
SP - 624
EP - 639
BT - SOSP 2024 - Proceedings of the 2024 ACM SIGOPS 30th Symposium on Operating Systems Principles
PB - Association for Computing Machinery, Inc
T2 - 30th ACM Symposium on Operating Systems Principles, SOSP 2024
Y2 - 4 November 2024 through 6 November 2024
ER -