TY - GEN
T1 - Pruning nearest neighbor cluster trees
AU - Kpotufe, Samory
AU - Von Luxburg, Ulrike
PY - 2011
Y1 - 2011
N2 - Nearest neighbor (k-NN) graphs are widely used in machine learning and data mining applications, and our aim is to better understand what they reveal about the cluster structure of the unknown underlying distribution of points. Moreover, is it possible to identify spurious structures that might arise due to sampling variability? Our first contribution is a statistical analysis that reveals how certain subgraphs of a k-NN graph form a consistent estimator of the cluster tree of the underlying distribution of points. Our second and perhaps most important contribution is the following finite sample guarantee. We carefully work out the tradeoff between aggressive and conservative pruning and are able to guarantee the removal of all spurious cluster structures at all levels of the tree while at the same time guaranteeing the recovery of salient clusters. This is the first such finite sample result in the context of clustering.
AB - Nearest neighbor (k-NN) graphs are widely used in machine learning and data mining applications, and our aim is to better understand what they reveal about the cluster structure of the unknown underlying distribution of points. Moreover, is it possible to identify spurious structures that might arise due to sampling variability? Our first contribution is a statistical analysis that reveals how certain subgraphs of a k-NN graph form a consistent estimator of the cluster tree of the underlying distribution of points. Our second and perhaps most important contribution is the following finite sample guarantee. We carefully work out the tradeoff between aggressive and conservative pruning and are able to guarantee the removal of all spurious cluster structures at all levels of the tree while at the same time guaranteeing the recovery of salient clusters. This is the first such finite sample result in the context of clustering.
UR - http://www.scopus.com/inward/record.url?scp=80053435626&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80053435626&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:80053435626
SN - 9781450306195
T3 - Proceedings of the 28th International Conference on Machine Learning, ICML 2011
SP - 225
EP - 232
BT - Proceedings of the 28th International Conference on Machine Learning, ICML 2011
T2 - 28th International Conference on Machine Learning, ICML 2011
Y2 - 28 June 2011 through 2 July 2011
ER -