Clustering of data sets is a standard problem in many areas of science and
engineering. The method of spectral clustering is based on embedding the data
set using a kernel function, and using the top eigenvectors of the normalized
Laplacian to recover the connected components. We study the performance of
spectral clustering in recovering the latent labels of i.i.d. samples from a
finite mixture of nonparametric distributions. The difficulty of this label
recovery problem depends on the overlap between mixture components and how
easily a mixture component is divided into two nonoverlapping components. When
the overlap is small compared to the indivisibility of the mixture components,
the principal eigenspace of the population-level normalized Laplacian operator
is approximately spanned by the square-root kernelized component densities. In
the finite sample setting, and under the same assumption, embedded samples from
different components are approximately orthogonal with high probability when
the sample size is large. As a corollary we control the fraction of samples
mislabeled by spectral clustering under finite mixtures with nonparametric
components.Comment: Published at http://dx.doi.org/10.1214/14-AOS1283 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org