951 research outputs found
Provable Estimation of the Number of Blocks in Block Models
Community detection is a fundamental unsupervised learning problem for
unlabeled networks which has a broad range of applications. Many community
detection algorithms assume that the number of clusters is known apriori.
In this paper, we propose an approach based on semi-definite relaxations, which
does not require prior knowledge of model parameters like many existing convex
relaxation methods and recovers the number of clusters and the clustering
matrix exactly under a broad parameter regime, with probability tending to one.
On a variety of simulated and real data experiments, we show that the proposed
method often outperforms state-of-the-art techniques for estimating the number
of clusters.Comment: 12 pages, 4 figure; AISTATS 201
On semidefinite relaxations for the block model
The stochastic block model (SBM) is a popular tool for community detection in
networks, but fitting it by maximum likelihood (MLE) involves a computationally
infeasible optimization problem. We propose a new semidefinite programming
(SDP) solution to the problem of fitting the SBM, derived as a relaxation of
the MLE. We put ours and previously proposed SDPs in a unified framework, as
relaxations of the MLE over various sub-classes of the SBM, revealing a
connection to sparse PCA. Our main relaxation, which we call SDP-1, is tighter
than other recently proposed SDP relaxations, and thus previously established
theoretical guarantees carry over. However, we show that SDP-1 exactly recovers
true communities over a wider class of SBMs than those covered by current
results. In particular, the assumption of strong assortativity of the SBM,
implicit in consistency conditions for previously proposed SDPs, can be relaxed
to weak assortativity for our approach, thus significantly broadening the class
of SBMs covered by the consistency results. We also show that strong
assortativity is indeed a necessary condition for exact recovery for previously
proposed SDP approaches and not an artifact of the proofs. Our analysis of SDPs
is based on primal-dual witness constructions, which provides some insight into
the nature of the solutions of various SDPs. We show how to combine features
from SDP-1 and already available SDPs to achieve the most flexibility in terms
of both assortativity and block-size constraints, as our relaxation has the
tendency to produce communities of similar sizes. This tendency makes it the
ideal tool for fitting network histograms, a method gaining popularity in the
graphon estimation literature, as we illustrate on an example of a social
networks of dolphins. We also provide empirical evidence that SDPs outperform
spectral methods for fitting SBMs with a large number of blocks
A Survey on Theoretical Advances of Community Detection in Networks
Real-world networks usually have community structure, that is, nodes are
grouped into densely connected communities. Community detection is one of the
most popular and best-studied research topics in network science and has
attracted attention in many different fields, including computer science,
statistics, social sciences, among others. Numerous approaches for community
detection have been proposed in literature, from ad-hoc algorithms to
systematic model-based approaches. The large number of available methods leads
to a fundamental question: whether a certain method can provide consistent
estimates of community labels. The stochastic blockmodel (SBM) and its variants
provide a convenient framework for the study of such problems. This article is
a survey on the recent theoretical advances of community detection. The authors
review a number of community detection methods and their theoretical
properties, including graph cut methods, profile likelihoods, the
pseudo-likelihood method, the variational method, belief propagation, spectral
clustering, and semidefinite relaxations of the SBM. The authors also briefly
discuss other research topics in community detection such as robust community
detection, community detection with nodal covariates and model selection, as
well as suggest a few possible directions for future research.Comment: Wire Computational Statistics, 201
Convex Relaxation Methods for Community Detection
This paper surveys recent theoretical advances in convex optimization
approaches for community detection. We introduce some important theoretical
techniques and results for establishing the consistency of convex community
detection under various statistical models. In particular, we discuss the basic
techniques based on the primal and dual analysis. We also present results that
demonstrate several distinctive advantages of convex community detection,
including robustness against outlier nodes, consistency under weak
assortativity, and adaptivity to heterogeneous degrees.
This survey is not intended to be a complete overview of the vast literature
on this fast-growing topic. Instead, we aim to provide a big picture of the
remarkable recent development in this area and to make the survey accessible to
a broad audience. We hope that this expository article can serve as an
introductory guide for readers who are interested in using, designing, and
analyzing convex relaxation methods in network analysis.Comment: 22 page
On Robustness of Kernel Clustering
Clustering is one of the most important unsupervised problems in machine
learning and statistics. Among many existing algorithms, kernel k-means has
drawn much research attention due to its ability to find non-linear cluster
boundaries and its inherent simplicity. There are two main approaches for
kernel k-means: SVD of the kernel matrix and convex relaxations. Despite the
attention kernel clustering has received both from theoretical and applied
quarters, not much is known about robustness of the methods. In this paper we
first introduce a semidefinite programming relaxation for the kernel clustering
problem, then prove that under a suitable model specification, both the K-SVD
and SDP approaches are consistent in the limit, albeit SDP is strongly
consistent, i.e. achieves exact recovery, whereas K-SVD is weakly consistent,
i.e. the fraction of misclassified nodes vanish.Comment: 20 pages, 3 figure
Point Localization and Density Estimation from Ordinal kNN graphs using Synchronization
We consider the problem of embedding unweighted, directed k-nearest neighbor
graphs in low-dimensional Euclidean space. The k-nearest neighbors of each
vertex provides ordinal information on the distances between points, but not
the distances themselves. We use this ordinal information along with the
low-dimensionality to recover the coordinates of the points up to arbitrary
similarity transformations (rigid transformations and scaling). Furthermore, we
also illustrate the possibility of robustly recovering the underlying density
via the Total Variation Maximum Penalized Likelihood Estimation (TV-MPLE)
method. We make existing approaches scalable by using an instance of a
local-to-global algorithm based on group synchronization, recently proposed in
the literature in the context of sensor network localization and structural
biology, which we augment with a scaling synchronization step. We demonstrate
the scalability of our approach on large graphs, and show how it compares to
the Local Ordinal Embedding (LOE) algorithm, which was recently proposed for
recovering the configuration of a cloud of points from pairwise ordinal
comparisons between a sparse set of distances
Exponential error rates of SDP for block models: Beyond Grothendieck's inequality
In this paper we consider the cluster estimation problem under the Stochastic
Block Model. We show that the semidefinite programming (SDP) formulation for
this problem achieves an error rate that decays exponentially in the
signal-to-noise ratio. The error bound implies weak recovery in the sparse
graph regime with bounded expected degrees, as well as exact recovery in the
dense regime. An immediate corollary of our results yields error bounds under
the Censored Block Model. Moreover, these error bounds are robust, continuing
to hold under heterogeneous edge probabilities and a form of the so-called
monotone attack.
Significantly, this error rate is achieved by the SDP solution itself without
any further pre- or post-processing, and improves upon existing
polynomially-decaying error bounds proved using the Grothendieck\textquoteright
s inequality. Our analysis has two key ingredients: (i) showing that the graph
has a well-behaved spectrum, even in the sparse regime, after discounting an
exponentially small number of edges, and (ii) an order-statistics argument that
governs the final error rate. Both arguments highlight the implicit
regularization effect of the SDP formulation
Maximum Likelihood Latent Space Embedding of Logistic Random Dot Product Graphs
A latent space model for a family of random graphs assigns real-valued
vectors to nodes of the graph such that edge probabilities are determined by
latent positions. Latent space models provide a natural statistical framework
for graph visualizing and clustering. A latent space model of particular
interest is the Random Dot Product Graph (RDPG), which can be fit using an
efficient spectral method; however, this method is based on a heuristic that
can fail, even in simple cases. Here, we consider a closely related latent
space model, the Logistic RDPG, which uses a logistic link function to map from
latent positions to edge likelihoods. Over this model, we show that
asymptotically exact maximum likelihood inference of latent position vectors
can be achieved using an efficient spectral method. Our method involves
computing top eigenvectors of a normalized adjacency matrix and scaling
eigenvectors using a regression step. The novel regression scaling step is an
essential part of the proposed method. In simulations, we show that our
proposed method is more accurate and more robust than common practices. We also
show the effectiveness of our approach over standard real networks of the
karate club and political blogs
Decoding binary node labels from censored edge measurements: Phase transition and efficient recovery
We consider the problem of clustering a graph into two communities by
observing a subset of the vertex correlations. Specifically, we consider the
inverse problem with observed variables , where is the
incidence matrix of a graph , is the vector of unknown vertex variables
(with a uniform prior) and is a noise vector with Bernoulli
i.i.d. entries. All variables and operations are Boolean. This model is
motivated by coding, synchronization, and community detection problems. In
particular, it corresponds to a stochastic block model or a correlation
clustering problem with two communities and censored edges. Without noise,
exact recovery (up to global flip) of is possible if and only the graph
is connected, with a sharp threshold at the edge probability for
Erd\H{o}s-R\'enyi random graphs. The first goal of this paper is to determine
how the edge probability needs to scale to allow exact recovery in the
presence of noise. Defining the degree (oversampling) rate of the graph by
, it is shown that exact recovery is possible if and only
if . In other words,
is the information theoretic threshold for exact
recovery at low-SNR. In addition, an efficient recovery algorithm based on
semidefinite programming is proposed and shown to succeed in the threshold
regime up to twice the optimal rate. For a deterministic graph , defining
the degree rate as , where is the minimum degree of the
graph, it is shown that the proposed method achieves the rate ,
where is the spectral gap of the graph .Comment: will appear in the IEEE Transactions on Network Science and
Engineerin
Two provably consistent divide and conquer clustering algorithms for large networks
In this article, we advance divide-and-conquer strategies for solving the
community detection problem in networks. We propose two algorithms which
perform clustering on a number of small subgraphs and finally patches the
results into a single clustering. The main advantage of these algorithms is
that they bring down significantly the computational cost of traditional
algorithms, including spectral clustering, semi-definite programs, modularity
based methods, likelihood based methods etc., without losing on accuracy and
even improving accuracy at times. These algorithms are also, by nature,
parallelizable. Thus, exploiting the facts that most traditional algorithms are
accurate and the corresponding optimization problems are much simpler in small
problems, our divide-and-conquer methods provide an omnibus recipe for scaling
traditional algorithms up to large networks. We prove consistency of these
algorithms under various subgraph selection procedures and perform extensive
simulations and real-data analysis to understand the advantages of the
divide-and-conquer approach in various settings.Comment: 41 pages, comments are most welcom
- …