Training Trajectories of Language Models Across Scales

Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, Ves Stoyanov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations


Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al., 2022)-from 125M to 175B parameters-on next-token prediction, sequence-level generation and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior (Nakkiran et al., 2020); 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; and 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independently of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.

Original languageEnglish (US)
Title of host publicationLong Papers
PublisherAssociation for Computational Linguistics (ACL)
Number of pages28
ISBN (Electronic)9781959429722
StatePublished - 2023
Event61st Annual Meeting of the Association for Computational Linguistics, ACL 2023 - Toronto, Canada
Duration: Jul 9 2023Jul 14 2023

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X


Conference61st Annual Meeting of the Association for Computational Linguistics, ACL 2023

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics


Dive into the research topics of 'Training Trajectories of Language Models Across Scales'. Together they form a unique fingerprint.

Cite this