TY - GEN
T1 - MUX-PLMs
T2 - 2023 Findings of the Association for Computational Linguistics: EMNLP 2023
AU - Murahari, Vishvak
AU - Deshpande, Ameet
AU - Jimenez, Carlos E.
AU - Shafran, Izhak
AU - Wang, Mingqiu
AU - Cao, Yuan
AU - Narasimhan, Karthik
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - The widespread adoption of large language models such as ChatGPT and Bard has led to unprecedented demand for these technologies. The burgeoning cost of inference for ever-increasing model sizes coupled with hardware shortages has limited affordable access and poses a pressing need for efficiency approaches geared towards high throughput and performance. Multi-input multi-output (MIMO) algorithms such as data multiplexing, offer a promising solution with a many-fold increase in throughput by performing inference for multiple inputs at the cost of a single input. Yet these approaches are not currently performant enough to be deployed in modern systems. We change that by developing MUX-PLMs, a class of deployable high throughput pre-trained language models (PLMs) trained with data multiplexing, that can be fine-tuned on any downstream task. Our novel multiplexing and demultiplexing modules proficiently entangle and disentangle inputs, and enable high-performance high throughput MUX-PLMs that are competitive with vanilla PLMs while achieving 2x/5x inference speedup with only a 1 − 4% performance drop on a broad suite of tasks.
AB - The widespread adoption of large language models such as ChatGPT and Bard has led to unprecedented demand for these technologies. The burgeoning cost of inference for ever-increasing model sizes coupled with hardware shortages has limited affordable access and poses a pressing need for efficiency approaches geared towards high throughput and performance. Multi-input multi-output (MIMO) algorithms such as data multiplexing, offer a promising solution with a many-fold increase in throughput by performing inference for multiple inputs at the cost of a single input. Yet these approaches are not currently performant enough to be deployed in modern systems. We change that by developing MUX-PLMs, a class of deployable high throughput pre-trained language models (PLMs) trained with data multiplexing, that can be fine-tuned on any downstream task. Our novel multiplexing and demultiplexing modules proficiently entangle and disentangle inputs, and enable high-performance high throughput MUX-PLMs that are competitive with vanilla PLMs while achieving 2x/5x inference speedup with only a 1 − 4% performance drop on a broad suite of tasks.
UR - http://www.scopus.com/inward/record.url?scp=85183288927&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85183288927&partnerID=8YFLogxK
U2 - 10.18653/v1/2023.findings-emnlp.301
DO - 10.18653/v1/2023.findings-emnlp.301
M3 - Conference contribution
AN - SCOPUS:85183288927
T3 - Findings of the Association for Computational Linguistics: EMNLP 2023
SP - 4540
EP - 4554
BT - Findings of the Association for Computational Linguistics
PB - Association for Computational Linguistics (ACL)
Y2 - 6 December 2023 through 10 December 2023
ER -