TY - GEN
T1 - Training Trajectories of Language Models Across Scales
AU - Xia, Mengzhou
AU - Artetxe, Mikel
AU - Zhou, Chunting
AU - Lin, Xi Victoria
AU - Pasunuru, Ramakanth
AU - Chen, Danqi
AU - Zettlemoyer, Luke
AU - Stoyanov, Ves
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al., 2022)-from 125M to 175B parameters-on next-token prediction, sequence-level generation and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior (Nakkiran et al., 2020); 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; and 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independently of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.
AB - Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al., 2022)-from 125M to 175B parameters-on next-token prediction, sequence-level generation and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior (Nakkiran et al., 2020); 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; and 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independently of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.
UR - http://www.scopus.com/inward/record.url?scp=85171654260&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85171654260&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85171654260
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 13711
EP - 13738
BT - Long Papers
PB - Association for Computational Linguistics (ACL)
T2 - 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023
Y2 - 9 July 2023 through 14 July 2023
ER -