TY - GEN

T1 - Diffusion maps - A probabilistic interpretation for spectral embedding and clustering algorithms

AU - Nadler, Boaz

AU - Lafon, Stephane

AU - Coifman, Ronald

AU - Kevrekidis, Ioannis G.

PY - 2008

Y1 - 2008

N2 - Spectral embedding and spectral clustering are common methods for non-linear dimensionality reduction and clustering of complex high dimensional datasets. In this paper we provide a diffusion based probabilistic analysis of algorithms that use the normalized graph Laplacian. Given the pairwise adjacency matrix of all points in a dataset, we define a random walk on the graph of points and a diffusion distance between any two points. We show that the diffusion distance is equal to the Euclidean distance in the embedded space with all eigenvectors of the normalized graph Laplacian. This identity shows that characteristic relaxation times and processes of the random walk on the graph are the key concept that governs the properties of these spectral clustering and spectral embedding algorithms. Specifically, for spectral clustering to succeed, a necessary condition is that the mean exit times from each cluster need to be significantly larger than the largest (slowest) of all relaxation times inside all of the individual clusters. For complex, multiscale data, this condition may not hold and multiscale methods need to be developed to handle such situations.

AB - Spectral embedding and spectral clustering are common methods for non-linear dimensionality reduction and clustering of complex high dimensional datasets. In this paper we provide a diffusion based probabilistic analysis of algorithms that use the normalized graph Laplacian. Given the pairwise adjacency matrix of all points in a dataset, we define a random walk on the graph of points and a diffusion distance between any two points. We show that the diffusion distance is equal to the Euclidean distance in the embedded space with all eigenvectors of the normalized graph Laplacian. This identity shows that characteristic relaxation times and processes of the random walk on the graph are the key concept that governs the properties of these spectral clustering and spectral embedding algorithms. Specifically, for spectral clustering to succeed, a necessary condition is that the mean exit times from each cluster need to be significantly larger than the largest (slowest) of all relaxation times inside all of the individual clusters. For complex, multiscale data, this condition may not hold and multiscale methods need to be developed to handle such situations.

UR - http://www.scopus.com/inward/record.url?scp=77952984416&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77952984416&partnerID=8YFLogxK

U2 - 10.1007/978-3-540-73750-6_10

DO - 10.1007/978-3-540-73750-6_10

M3 - Conference contribution

AN - SCOPUS:77952984416

SN - 9783540737490

T3 - Lecture Notes in Computational Science and Engineering

SP - 238

EP - 260

BT - Principal Manifolds for Data Visualization and Dimension Reduction

ER -