DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling

Shikhar Tuli, Chi Heng Lin, Yen Chang Hsu, Niraj K. Jha, Yilin Shen, Hongxia Jin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models dynamically predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite, DynaMo-7.3BT3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57× speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively.

Original languageEnglish (US)
Title of host publicationLong Papers
EditorsKevin Duh, Helena Gomez, Steven Bethard
PublisherAssociation for Computational Linguistics (ACL)
Pages3322-3345
Number of pages24
ISBN (Electronic)9798891761148
StatePublished - 2024
Externally publishedYes
Event2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2024 - Hybrid, Mexico City, Mexico
Duration: Jun 16 2024Jun 21 2024

Publication series

NameProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2024
Volume1

Conference

Conference2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2024
Country/TerritoryMexico
CityHybrid, Mexico City
Period6/16/246/21/24

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling'. Together they form a unique fingerprint.

Cite this