## Abstract

We give the first outlier-robust efficient algorithm for clustering a mixture of k statistically separated d-dimensional Gaussians (k-GMMs). Concretely, our algorithm takes input an epsilon-corrupted sample from a k-GMM and outputs an approximate clustering that misclassifies at most k{O(k)}(epsilon+ eta) fraction of the points whenever every pair of mixture components are separated by 1-exp(-poly(k eta)) in total variation distance. This is the statistically weakest possible notion of separation and allows, for e.g., clustering of mixtures with components with the same mean with covariances differing in a single unknown direction or separated in Frobenius distance. The running time of our algorithm is d{poly(k eta)}. Such results were not known prior to our work, even for k=2. More generally, our algorithms succeed for mixtures of any distribution that satisfies two well-studied analytic assumptions-sum-of-squares certifiable hypercontractivity and anti-concentration. As an immediate corollary, they extend to clustering mixtures of arbitrary affine transforms of the uniform distribution on the d-dimensional unit sphere. Even the information theoretic clusterability of separated distributions satisfying our analytic assumptions was not known and is likely to be of independent interest. Our algorithms build on the recent flurry of work relying on certifiable anti-concentration first introduced in [1], [2]. Our techniques expand the sum-of-squares toolkit to show robust certifiability of TV-separated Gaussian clusters in data. This involves giving a low-degree sum-of-squares proof of statements that relate parameter (i.e. mean and covariances) distance to total variation distance by relying only on hypercontractivity and anti-concentration.

Original language | English (US) |
---|---|

Title of host publication | Proceedings - 2020 IEEE 61st Annual Symposium on Foundations of Computer Science, FOCS 2020 |

Publisher | IEEE Computer Society |

Pages | 149-159 |

Number of pages | 11 |

ISBN (Electronic) | 9781728196213 |

DOIs | |

State | Published - Nov 2020 |

Externally published | Yes |

Event | 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020 - Virtual, Durham, United States Duration: Nov 16 2020 → Nov 19 2020 |

### Publication series

Name | Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS |
---|---|

Volume | 2020-November |

ISSN (Print) | 0272-5428 |

### Conference

Conference | 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020 |
---|---|

Country/Territory | United States |

City | Virtual, Durham |

Period | 11/16/20 → 11/19/20 |

## All Science Journal Classification (ASJC) codes

- General Computer Science

## Keywords

- Gaussian Mixture Models
- Robust statistics
- Sum of Squares Method