TY - GEN
T1 - CAPTURING THE TEMPORAL DEPENDENCE OF TRAINING DATA INFLUENCE
AU - Wang, Jiachen T.
AU - Song, Dawn
AU - Zou, James
AU - Mittal, Prateek
AU - Jia, Ruoxi
N1 - Publisher Copyright:
© 2025 13th International Conference on Learning Representations, ICLR 2025. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Traditional data influence estimation methods, like influence function, assume that learning algorithms are permutation-invariant with respect to training data. However, modern training paradigms, especially for foundation models using stochastic algorithms and multi-stage curricula, are sensitive to data ordering, thus violating this assumption. This mismatch renders influence functions inadequate for answering a critical question in machine learning: How can we capture the dependence of data influence on the optimization trajectory during training? To address this gap, we formalize the concept of trajectory-specific leave-one-out (LOO) influence, which quantifies the impact of removing a data point from a specific iteration during training, accounting for the exact sequence of data encountered and the model's optimization trajectory. However, exactly evaluating the trajectory-specific LOO presents a significant computational challenge. To address this, we propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO. Specifically, we compute a training data embedding that encapsulates the cumulative interactions between data and the evolving model parameters. The LOO can then be efficiently approximated through a simple dot-product between the data value embedding and the gradient of the given test data. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics. In particular, we uncover distinct phases of data influence, revealing that data points in the early and late stages of training exert a greater impact on the final model. These insights translate into actionable strategies for managing the computational overhead of data selection by strategically timing the selection process, potentially opening new avenues in data curation research.
AB - Traditional data influence estimation methods, like influence function, assume that learning algorithms are permutation-invariant with respect to training data. However, modern training paradigms, especially for foundation models using stochastic algorithms and multi-stage curricula, are sensitive to data ordering, thus violating this assumption. This mismatch renders influence functions inadequate for answering a critical question in machine learning: How can we capture the dependence of data influence on the optimization trajectory during training? To address this gap, we formalize the concept of trajectory-specific leave-one-out (LOO) influence, which quantifies the impact of removing a data point from a specific iteration during training, accounting for the exact sequence of data encountered and the model's optimization trajectory. However, exactly evaluating the trajectory-specific LOO presents a significant computational challenge. To address this, we propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO. Specifically, we compute a training data embedding that encapsulates the cumulative interactions between data and the evolving model parameters. The LOO can then be efficiently approximated through a simple dot-product between the data value embedding and the gradient of the given test data. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics. In particular, we uncover distinct phases of data influence, revealing that data points in the early and late stages of training exert a greater impact on the final model. These insights translate into actionable strategies for managing the computational overhead of data selection by strategically timing the selection process, potentially opening new avenues in data curation research.
UR - https://www.scopus.com/pages/publications/105010266472
UR - https://www.scopus.com/pages/publications/105010266472#tab=citedBy
M3 - Conference contribution
AN - SCOPUS:105010266472
T3 - 13th International Conference on Learning Representations, ICLR 2025
SP - 85607
EP - 85644
BT - 13th International Conference on Learning Representations, ICLR 2025
PB - International Conference on Learning Representations, ICLR
T2 - 13th International Conference on Learning Representations, ICLR 2025
Y2 - 24 April 2025 through 28 April 2025
ER -