Abstract
The worst-case data-generating (WCDG) probability measure is introduced as a tool for characterizing the generalization capabilities of machine learning algorithms. Such a WCDG probability measure is shown to be the unique solution to two different optimization problems: (a) The maximization of the expected loss over the set of probability measures on the datasets whose relative entropy with respect to a reference measure is not larger than a given threshold; and (b) The maximization of the expected loss with regularization by relative entropy with respect to the reference measure. Such a reference measure can be interpreted as a prior on the datasets. The WCDG cumulants are finite and bounded in terms of the cumulants of the reference measure. To analyze the concentration of the expected empirical risk induced by the WCDG probability measure, the notion of (epsilon, delta ) -robustness of models is introduced. Closed-form expressions are presented for the sensitivity of the expected loss for a fixed model. These results lead to a novel expression for the generalization error of arbitrary machine learning algorithms. This exact expression is provided in terms of the WCDG probability measure and leads to an upper bound that is equal to the sum of the mutual information and the lautum information between the models and the datasets, up to a constant factor. This upper bound is achieved by a Gibbs algorithm. This finding reveals that an exploration into the generalization error of the Gibbs algorithm facilitates the derivation of overarching insights applicable to any machine learning algorithm.
Original language | English (US) |
---|---|
Pages (from-to) | 175-189 |
Number of pages | 15 |
Journal | IEEE Journal on Selected Areas in Information Theory |
Volume | 5 |
DOIs | |
State | Published - 2024 |
Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications
- Media Technology
- Artificial Intelligence
- Applied Mathematics
Keywords
- Gibbs algorithm
- Supervised machine learning
- generalization gap
- relative entropy
- sensitivity
- worst-case