10 research outputs found
Clustering Partially Observed Graphs via Convex Optimization
This paper considers the problem of clustering a partially observed
unweighted graph---i.e., one where for some node pairs we know there is an edge
between them, for some others we know there is no edge, and for the remaining
we do not know whether or not there is an edge. We want to organize the nodes
into disjoint clusters so that there is relatively dense (observed)
connectivity within clusters, and sparse across clusters.
We take a novel yet natural approach to this problem, by focusing on finding
the clustering that minimizes the number of "disagreements"---i.e., the sum of
the number of (observed) missing edges within clusters, and (observed) present
edges across clusters. Our algorithm uses convex optimization; its basis is a
reduction of disagreement minimization to the problem of recovering an
(unknown) low-rank matrix and an (unknown) sparse matrix from their partially
observed sum. We evaluate the performance of our algorithm on the classical
Planted Partition/Stochastic Block Model. Our main theorem provides sufficient
conditions for the success of our algorithm as a function of the minimum
cluster size, edge density and observation probability; in particular, the
results characterize the tradeoff between the observation probability and the
edge density gap. When there are a constant number of clusters of equal size,
our results are optimal up to logarithmic factors.Comment: This is the final version published in Journal of Machine Learning
Research (JMLR). Partial results appeared in International Conference on
Machine Learning (ICML) 201
Guaranteed clustering and biclustering via semidefinite programming
Identifying clusters of similar objects in data plays a significant role in a
wide range of applications. As a model problem for clustering, we consider the
densest k-disjoint-clique problem, whose goal is to identify the collection of
k disjoint cliques of a given weighted complete graph maximizing the sum of the
densities of the complete subgraphs induced by these cliques. In this paper, we
establish conditions ensuring exact recovery of the densest k cliques of a
given graph from the optimal solution of a particular semidefinite program. In
particular, the semidefinite relaxation is exact for input graphs corresponding
to data consisting of k large, distinct clusters and a smaller number of
outliers. This approach also yields a semidefinite relaxation for the
biclustering problem with similar recovery guarantees. Given a set of objects
and a set of features exhibited by these objects, biclustering seeks to
simultaneously group the objects and features according to their expression
levels. This problem may be posed as partitioning the nodes of a weighted
bipartite complete graph such that the sum of the densities of the resulting
bipartite complete subgraphs is maximized. As in our analysis of the densest
k-disjoint-clique problem, we show that the correct partition of the objects
and features can be recovered from the optimal solution of a semidefinite
program in the case that the given data consists of several disjoint sets of
objects exhibiting similar features. Empirical evidence from numerical
experiments supporting these theoretical guarantees is also provided
Community Detection via Measure Space Embedding
Abstract We present a new algorithm for community detection. The algorithm uses random walks to embed the graph in a space of measures, after which a modification of k-means in that space is applied. The algorithm is therefore fast and easily parallelizable. We evaluate the algorithm on standard random graph benchmarks, including some overlapping community benchmarks, and find its performance to be better or at least as good as previously known algorithms. We also prove a linear time (in number of edges) guarantee for the algorithm on a p, q-stochastic block model with where p ≥ c ·
Improved Theoretical and Practical Guarantees for Chromatic Correlation Clustering
We study a natural generalization of the correlation cluster-ing problem to graphs in which the pairwise relations be-tween objects are categorical instead of binary. This prob-lem was recently introduced by Bonchi et al. under the name of chromatic correlation clustering, and is motivated by many real-world applications in data-mining and social networks, including community detection, link classification, and entity de-duplication. Our main contribution is a fast and easy-to-implement constant approximation framework for the problem, which builds on a novel reduction of the problem to that of cor-relation clustering. This result significantly progresses the current state of knowledge for the problem, improving on a previous result that only guaranteed linear approximation in the input size. We complement the above result by devel-oping a linear programming-based algorithm that achieves an improved approximation ratio of 4. Although this al-gorithm cannot be considered to be practical, it further ex-tends our theoretical understanding of chromatic correlation clustering. We also present a fast heuristic algorithm that is motivated by real-life scenarios in which there is a ground-truth clustering that is obscured by noisy observations. We test our algorithms on both synthetic and real datasets, like social networks data. Our experiments reinforce the theoret-ical findings by demonstrating that our algorithms generally outperform previous approaches, both in terms of solution cost and reconstruction of an underlying ground-truth clus-tering
Improved Algorithms for the Random Cluster Graph Model
The following probabilistic process models the generation of noisy clustering data: Clusters correspond to disjoint sets of vertices in a graph. Each two vertices from the same set are connected by an edge with probability p, and each two vertices from different sets are connected by an edge with probability r < p. The goal of the clustering problem is to reconstruct the clusters from the graph. We give algorithms that solve this problem with high probability. Compared to previous studies, our algorithms have lower time complexity and wider parameter range of applicability. In particular, our algorithms can handle O( n/ log n) clusters in an n-vertex graph, while all previous algorithms require that the number of clusters is constant