16,872 research outputs found
Approximating Spectral Clustering via Sampling: a Review
International audienceSpectral clustering refers to a family of well-known unsupervised learning algorithms. Rather than attempting to cluster points in their native domain, one constructs a (usually sparse) similarity graph and computes the principal eigenvec-tors of its Laplacian. The eigenvectors are then interpreted as transformed points and fed into a k-means clustering algorithm. As a result of this non-linear transformation , it becomes possible to use a simple centroid-based algorithm in order to identify non-convex clusters, something that was otherwise impossible. Unfortunately , what makes spectral clustering so successful is also its Achilles heel: forming a graph and computing its dominant eigenvectors can be computationally prohibitive when dealing with more that a few tens of thousands of points. In this chapter, we review the principal research efforts aiming to reduce this computational cost. We focus on methods that come with a theoretical control on the clustering performance and incorporate some form of sampling in their operation. Such methods abound in the machine learning, numerical linear algebra, and graph signal processing literature and, amongst others, include Nyström-approximation, landmarks, coarsening, coresets, and compressive spectral clustering. We present the approximation guarantees available for each and discuss practical merits and limitations. Surprisingly, despite the breadth of the literature explored, we conclude that there is still a gap between theory and practice: the most scalable methods are only intuitively motivated or loosely controlled, whereas those that come with end-to-end guarantees rely on strong assumptions or enable a limited gain of computation time
On learning the structure of clusters in graphs
Graph clustering is a fundamental problem in unsupervised learning, with numerous applications in computer science and in analysing real-world data. In many real-world applications, we find that the clusters have a significant high-level structure. This is often overlooked in the design and analysis of graph clustering algorithms which make strong simplifying assumptions about the structure of the graph. This thesis addresses the natural question of whether the structure of clusters can be learned efficiently and describes four new algorithmic results for learning such structure in graphs and hypergraphs.
The first part of the thesis studies the classical spectral clustering algorithm, and presents a tighter analysis on its performance. This result explains why it works under a much weaker and more natural condition than the ones studied in the literature, and helps to close the gap between the theoretical guarantees of the spectral clustering algorithm and its excellent empirical performance.
The second part of the thesis builds on the theoretical guarantees of the previous part and shows that, when the clusters of the underlying graph have certain structures, spectral clustering with fewer than k eigenvectors is able to produce better output than classical spectral clustering in which k eigenvectors are employed, where k is the number of clusters. This presents the first work that discusses and analyses the performance of spectral clustering with fewer than k eigenvectors, and shows that general structures of clusters can be learned with spectral methods.
The third part of the thesis considers efficient learning of the structure of clusters with local algorithms, whose runtime depends only on the size of the target clusters and is independent of the underlying input graph. While the objective of classical local clustering algorithms is to find a cluster which is sparsely connected to the rest of the graph, this part of the thesis presents a local algorithm that finds a pair of clusters which are densely connected to each other. This result demonstrates that certain structures of clusters can be learned efficiently in the local setting, even in the massive graphs which are ubiquitous in real-world applications.
The final part of the thesis studies the problem of learning densely connected clusters in hypergraphs. The developed algorithm is based on a new heat diffusion process, whose analysis extends a sequence of recent work on the spectral theory of hypergraphs. It allows the structure of clusters to be learned in datasets modelling higher-order relations of objects and can be applied to efficiently analyse many complex datasets occurring in practice.
All of the presented theoretical results are further extensively evaluated on both synthetic and real-word datasets of different domains, including image classification and segmentation, migration networks, co-authorship networks, and natural language processing. These experimental results demonstrate that the newly developed algorithms are practical, effective, and immediately applicable for learning the structure of clusters in real-world data
Spectral clustering and the high-dimensional stochastic blockmodel
Networks or graphs can easily represent a diverse set of data sources that
are characterized by interacting units or actors. Social networks, representing
people who communicate with each other, are one example. Communities or
clusters of highly connected actors form an essential feature in the structure
of several empirical networks. Spectral clustering is a popular and
computationally feasible method to discover these communities. The stochastic
blockmodel [Social Networks 5 (1983) 109--137] is a social network model with
well-defined communities; each node is a member of one community. For a network
generated from the Stochastic Blockmodel, we bound the number of nodes
"misclustered" by spectral clustering. The asymptotic results in this paper are
the first clustering results that allow the number of clusters in the model to
grow with the number of nodes, hence the name high-dimensional. In order to
study spectral clustering under the stochastic blockmodel, we first show that
under the more general latent space model, the eigenvectors of the normalized
graph Laplacian asymptotically converge to the eigenvectors of a "population"
normalized graph Laplacian. Aside from the implication for spectral clustering,
this provides insight into a graph visualization technique. Our method of
studying the eigenvectors of random matrices is original.Comment: Published in at http://dx.doi.org/10.1214/11-AOS887 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Performance Analysis of Spectral Clustering on Compressed, Incomplete and Inaccurate Measurements
Spectral clustering is one of the most widely used techniques for extracting
the underlying global structure of a data set. Compressed sensing and matrix
completion have emerged as prevailing methods for efficiently recovering sparse
and partially observed signals respectively. We combine the distance preserving
measurements of compressed sensing and matrix completion with the power of
robust spectral clustering. Our analysis provides rigorous bounds on how small
errors in the affinity matrix can affect the spectral coordinates and
clusterability. This work generalizes the current perturbation results of
two-class spectral clustering to incorporate multi-class clustering with k
eigenvectors. We thoroughly track how small perturbation from using compressed
sensing and matrix completion affect the affinity matrix and in succession the
spectral coordinates. These perturbation results for multi-class clustering
require an eigengap between the kth and (k+1)th eigenvalues of the affinity
matrix, which naturally occurs in data with k well-defined clusters. Our
theoretical guarantees are complemented with numerical results along with a
number of examples of the unsupervised organization and clustering of image
data
Covariate-assisted spectral clustering
Biological and social systems consist of myriad interacting units. The
interactions can be represented in the form of a graph or network. Measurements
of these graphs can reveal the underlying structure of these interactions,
which provides insight into the systems that generated the graphs. Moreover, in
applications such as connectomics, social networks, and genomics, graph data
are accompanied by contextualizing measures on each node. We utilize these node
covariates to help uncover latent communities in a graph, using a modification
of spectral clustering. Statistical guarantees are provided under a joint
mixture model that we call the node-contextualized stochastic blockmodel,
including a bound on the mis-clustering rate. The bound is used to derive
conditions for achieving perfect clustering. For most simulated cases,
covariate-assisted spectral clustering yields results superior to regularized
spectral clustering without node covariates and to an adaptation of canonical
correlation analysis. We apply our clustering method to large brain graphs
derived from diffusion MRI data, using the node locations or neurological
region membership as covariates. In both cases, covariate-assisted spectral
clustering yields clusters that are easier to interpret neurologically.Comment: 28 pages, 4 figures, includes substantial changes to theoretical
result
Spectral Embedding Norm: Looking Deep into the Spectrum of the Graph Laplacian
The extraction of clusters from a dataset which includes multiple clusters
and a significant background component is a non-trivial task of practical
importance. In image analysis this manifests for example in anomaly detection
and target detection. The traditional spectral clustering algorithm, which
relies on the leading eigenvectors to detect clusters, fails in such
cases. In this paper we propose the {\it spectral embedding norm} which sums
the squared values of the first normalized eigenvectors, where can be
significantly larger than . We prove that this quantity can be used to
separate clusters from the background in unbalanced settings, including extreme
cases such as outlier detection. The performance of the algorithm is not
sensitive to the choice of , and we demonstrate its application on synthetic
and real-world remote sensing and neuroimaging datasets
- …