TY - GEN
T1 - Dimensionality reduction for k-means clustering and low rank approximation
AU - Cohen, Michael B.
AU - Elder, Sam
AU - Musco, Cameron
AU - Musco, Christopher
AU - Persu, Madalina
N1 - Copyright:
Copyright 2016 Elsevier B.V., All rights reserved.
PY - 2015/6/14
Y1 - 2015/6/14
N2 - We show how to approximate a data matrix A with a much smaller sketch A that can be used to solve a general class of constrained k-rank approximation problems to within (1 + ∈) error. Importantly, this class includes k-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just O(k) dimensions, we generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For k-means dimensionality reduction, we provide (1 + ∈) relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only 'cover' a good subspace for A, but can be used directly to compute this subspace. Finally, for k-means clustering, we show how to achieve a (9 + ∈) approximation by Johnson-Lindenstrauss projecting data to just O(logk/∈2 ) dimensions. This is the first result that leverages the specific structure of k-means to achieve dimension independent of input size and sublinear in k.
AB - We show how to approximate a data matrix A with a much smaller sketch A that can be used to solve a general class of constrained k-rank approximation problems to within (1 + ∈) error. Importantly, this class includes k-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just O(k) dimensions, we generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For k-means dimensionality reduction, we provide (1 + ∈) relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only 'cover' a good subspace for A, but can be used directly to compute this subspace. Finally, for k-means clustering, we show how to achieve a (9 + ∈) approximation by Johnson-Lindenstrauss projecting data to just O(logk/∈2 ) dimensions. This is the first result that leverages the specific structure of k-means to achieve dimension independent of input size and sublinear in k.
UR - http://www.scopus.com/inward/record.url?scp=84958764795&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84958764795&partnerID=8YFLogxK
U2 - 10.1145/2746539.2746569
DO - 10.1145/2746539.2746569
M3 - Conference contribution
AN - SCOPUS:84958764795
T3 - Proceedings of the Annual ACM Symposium on Theory of Computing
SP - 163
EP - 172
BT - STOC 2015 - Proceedings of the 2015 ACM Symposium on Theory of Computing
PB - Association for Computing Machinery
T2 - 47th Annual ACM Symposium on Theory of Computing, STOC 2015
Y2 - 14 June 2015 through 17 June 2015
ER -