6,217 research outputs found
Spectral Clustering with Imbalanced Data
Spectral clustering is sensitive to how graphs are constructed from data
particularly when proximal and imbalanced clusters are present. We show that
Ratio-Cut (RCut) or normalized cut (NCut) objectives are not tailored to
imbalanced data since they tend to emphasize cut sizes over cut values. We
propose a graph partitioning problem that seeks minimum cut partitions under
minimum size constraints on partitions to deal with imbalanced data. Our
approach parameterizes a family of graphs, by adaptively modulating node
degrees on a fixed node set, to yield a set of parameter dependent cuts
reflecting varying levels of imbalance. The solution to our problem is then
obtained by optimizing over these parameters. We present rigorous limit cut
analysis results to justify our approach. We demonstrate the superiority of our
method through unsupervised and semi-supervised experiments on synthetic and
real data sets.Comment: 24 pages, 7 figures. arXiv admin note: substantial text overlap with
arXiv:1302.513
Clustering and Community Detection with Imbalanced Clusters
Spectral clustering methods which are frequently used in clustering and
community detection applications are sensitive to the specific graph
constructions particularly when imbalanced clusters are present. We show that
ratio cut (RCut) or normalized cut (NCut) objectives are not tailored to
imbalanced cluster sizes since they tend to emphasize cut sizes over cut
values. We propose a graph partitioning problem that seeks minimum cut
partitions under minimum size constraints on partitions to deal with imbalanced
cluster sizes. Our approach parameterizes a family of graphs by adaptively
modulating node degrees on a fixed node set, yielding a set of parameter
dependent cuts reflecting varying levels of imbalance. The solution to our
problem is then obtained by optimizing over these parameters. We present
rigorous limit cut analysis results to justify our approach and demonstrate the
superiority of our method through experiments on synthetic and real datasets
for data clustering, semi-supervised learning and community detection.Comment: Extended version of arXiv:1309.2303 with new applications. Accepted
to IEEE TSIP
How to Round Subspaces: A New Spectral Clustering Algorithm
A basic problem in spectral clustering is the following. If a solution
obtained from the spectral relaxation is close to an integral solution, is it
possible to find this integral solution even though they might be in completely
different basis? In this paper, we propose a new spectral clustering algorithm.
It can recover a -partition such that the subspace corresponding to the span
of its indicator vectors is close to the original subspace in
spectral norm with being the minimum possible ( always).
Moreover our algorithm does not impose any restriction on the cluster sizes.
Previously, no algorithm was known which could find a -partition closer than
.
We present two applications for our algorithm. First one finds a disjoint
union of bounded degree expanders which approximate a given graph in spectral
norm. The second one is for approximating the sparsest -partition in a graph
where each cluster have expansion at most provided where is the eigenvalue of
Laplacian matrix. This significantly improves upon the previous algorithms,
which required .Comment: Appeared in SODA 201
Recent Advances in Graph Partitioning
We survey recent trends in practical algorithms for balanced graph
partitioning together with applications and future research directions
Preconditioned Spectral Clustering for Stochastic Block Partition Streaming Graph Challenge
Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) is
demonstrated to efficiently solve eigenvalue problems for graph Laplacians that
appear in spectral clustering. For static graph partitioning, 10-20 iterations
of LOBPCG without preconditioning result in ~10x error reduction, enough to
achieve 100% correctness for all Challenge datasets with known truth
partitions, e.g., for graphs with 5K/.1M (50K/1M) Vertices/Edges in 2 (7)
seconds, compared to over 5,000 (30,000) seconds needed by the baseline Python
code. Our Python code 100% correctly determines 98 (160) clusters from the
Challenge static graphs with 0.5M (2M) vertices in 270 (1,700) seconds using
10GB (50GB) of memory. Our single-precision MATLAB code calculates the same
clusters at half time and memory. For streaming graph partitioning, LOBPCG is
initiated with approximate eigenvectors of the graph Laplacian already computed
for the previous graph, in many cases reducing 2-3 times the number of required
LOBPCG iterations, compared to the static case. Our spectral clustering is
generic, i.e. assuming nothing specific of the block model or streaming, used
to generate the graphs for the Challenge, in contrast to the base code.
Nevertheless, in 10-stage streaming comparison with the base code for the 5K
graph, the quality of our clusters is similar or better starting at stage 4 (7)
for emerging edging (snowballing) streaming, while the computations are over
100-1000 faster.Comment: 6 pages. To appear in Proceedings of the 2017 IEEE High Performance
Extreme Computing Conference. Student Innovation Award Streaming Graph
Challenge: Stochastic Block Partition, see
http://graphchallenge.mit.edu/champion
Covariate-assisted spectral clustering
Biological and social systems consist of myriad interacting units. The
interactions can be represented in the form of a graph or network. Measurements
of these graphs can reveal the underlying structure of these interactions,
which provides insight into the systems that generated the graphs. Moreover, in
applications such as connectomics, social networks, and genomics, graph data
are accompanied by contextualizing measures on each node. We utilize these node
covariates to help uncover latent communities in a graph, using a modification
of spectral clustering. Statistical guarantees are provided under a joint
mixture model that we call the node-contextualized stochastic blockmodel,
including a bound on the mis-clustering rate. The bound is used to derive
conditions for achieving perfect clustering. For most simulated cases,
covariate-assisted spectral clustering yields results superior to regularized
spectral clustering without node covariates and to an adaptation of canonical
correlation analysis. We apply our clustering method to large brain graphs
derived from diffusion MRI data, using the node locations or neurological
region membership as covariates. In both cases, covariate-assisted spectral
clustering yields clusters that are easier to interpret neurologically.Comment: 28 pages, 4 figures, includes substantial changes to theoretical
result
- …