GREATS: Online Selection of High-Quality Data for LLM Training in Every Iteration

Jiachen T. Wang, Tong Wu, Dawn Song, Prateek Mittal, Ruoxi Jia

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

Online batch selection methods offer an adaptive alternative to static training data selection by dynamically selecting data batches during training. However, existing methods either rely on impractical reference models or simple heuristics that may not capture true data informativeness. To address these limitations, we propose GREedy Approximation Taylor Selection (GREATS), a principled and efficient online batch selection method that applies greedy algorithm to optimize the data batch quality approximated by Taylor expansion. We develop a series of techniques to scale GREATS to large-scale model training. Extensive experiments with large language models (LLMs) demonstrate that GREATS significantly improves training convergence speed and generalization performance. Our codebase is publically available at https://github.com/Jiachen-T-Wang/GREATS.

Original languageEnglish (US)
JournalAdvances in Neural Information Processing Systems
Volume37
StatePublished - 2024
Event38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada
Duration: Dec 9 2024Dec 15 2024

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Fingerprint

Dive into the research topics of 'GREATS: Online Selection of High-Quality Data for LLM Training in Every Iteration'. Together they form a unique fingerprint.

Cite this