322 research outputs found
Consistent Estimation of Mixed Memberships with Successive Projections
This paper considers the parameter estimation problem in Mixed Membership
Stochastic Block Model (MMSB), which is a quite general instance of random
graph model allowing for overlapping community structure. We present the new
algorithm successive projection overlapping clustering (SPOC) which combines
the ideas of spectral clustering and geometric approach for separable
non-negative matrix factorization. The proposed algorithm is provably
consistent under MMSB with general conditions on the parameters of the model.
SPOC is also shown to perform well experimentally in comparison to other
algorithms
Alternative sampling for variational quantum Monte Carlo
Expectation values of physical quantities may accurately be obtained by the
evaluation of integrals within Many-Body Quantum mechanics, and these
multi-dimensional integrals may be estimated using Monte Carlo methods. In a
previous publication it has been shown that for the simplest, most commonly
applied strategy in continuum Quantum Monte Carlo, the random error in the
resulting estimates is not well controlled. At best the Central Limit theorem
is valid in its weakest form, and at worst it is invalid and replaced by an
alternative Generalised Central Limit theorem and non-Normal random error. In
both cases the random error is not controlled. Here we consider a new `residual
sampling strategy' that reintroduces the Central Limit Theorem in its strongest
form, and provides full control of the random error in estimates. Estimates of
the total energy and the variance of the local energy within Variational Monte
Carlo are considered in detail, and the approach presented may be generalised
to expectation values of other operators, and to other variants of the Quantum
Monte Carlo method.Comment: 14 pages, 9 figure
Comparing spectra of graph shift operator matrices
Typically network structures are represented by one of three different graph shift operator matrices: the adjacency matrix and unnormalised and normalised Laplacian matrices. To enable a sensible comparison of their spectral (eigenvalue) properties, an affine transform is first applied to one of them, which preserves eigengaps. Bounds, which depend on the minimum and maximum degree of the network, are given on the resulting eigenvalue differences. The monotonicity of the bounds and the structure of networks are related. Bounds, which again depend on the minimum and maximum degree of the network, are also given for normalised eigengap differences, used in spectral clustering. Results are illustrated on the karate dataset and a stochastic block model. If the degree extreme difference is large, different choices of graph shift operator matrix may give rise to disparate inference drawn from network analysis; contrariwise, smaller degree extreme difference results in consistent inference
Large Scale Spectral Clustering Using Approximate Commute Time Embedding
Spectral clustering is a novel clustering method which can detect complex
shapes of data clusters. However, it requires the eigen decomposition of the
graph Laplacian matrix, which is proportion to and thus is not
suitable for large scale systems. Recently, many methods have been proposed to
accelerate the computational time of spectral clustering. These approximate
methods usually involve sampling techniques by which a lot information of the
original data may be lost. In this work, we propose a fast and accurate
spectral clustering approach using an approximate commute time embedding, which
is similar to the spectral embedding. The method does not require using any
sampling technique and computing any eigenvector at all. Instead it uses random
projection and a linear time solver to find the approximate embedding. The
experiments in several synthetic and real datasets show that the proposed
approach has better clustering quality and is faster than the state-of-the-art
approximate spectral clustering methods
Graph similarity through entropic manifold alignment
In this paper we decouple the problem of measuring graph similarity into two sequential steps. The first step is the linearization of the quadratic assignment problem (QAP) in a low-dimensional space, given by the embedding trick. The second step is the evaluation of an information-theoretic distributional measure, which relies on deformable manifold alignment. The proposed measure is a normalized conditional entropy, which induces a positive definite kernel when symmetrized. We use bypass entropy estimation methods to compute an approximation of the normalized conditional entropy. Our approach, which is purely topological (i.e., it does not rely on node or edge attributes although it can potentially accommodate them as additional sources of information) is competitive with state-of-the-art graph matching algorithms as sources of correspondence-based graph similarity, but its complexity is linear instead of cubic (although the complexity of the similarity measure is quadratic). We also determine that the best embedding strategy for graph similarity is provided by commute time embedding, and we conjecture that this is related to its inversibility property, since the inverse of the embeddings obtained using our method can be used as a generative sampler of graph structure.The work of the first and third authors was supported by the projects TIN2012-32839 and TIN2015-69077-P of the Spanish Government. The work of the second author was supported by a Royal Society Wolfson Research Merit Award
A Spectral Algorithm with Additive Clustering for the Recovery of Overlapping Communities in Networks
This paper presents a novel spectral algorithm with additive clustering
designed to identify overlapping communities in networks. The algorithm is
based on geometric properties of the spectrum of the expected adjacency matrix
in a random graph model that we call stochastic blockmodel with overlap (SBMO).
An adaptive version of the algorithm, that does not require the knowledge of
the number of hidden communities, is proved to be consistent under the SBMO
when the degrees in the graph are (slightly more than) logarithmic. The
algorithm is shown to perform well on simulated data and on real-world graphs
with known overlapping communities.Comment: Journal of Theoretical Computer Science (TCS), Elsevier, A Para\^itr
On the Interplay between Strong Regularity and Graph Densification
In this paper we analyze the practical implications of Szemerédi’s regularity lemma in the preservation of metric information contained in large graphs. To this end, we present a heuristic algorithm to find regular partitions. Our experiments show that this method is quite robust to the natural sparsification of proximity graphs. In addition, this robustness can be enforced by graph densification
Mathematical Analysis of Copy Number Variation in a DNA Sample Using Digital PCR on a Nanofluidic Device
Copy Number Variations (CNVs) of regions of the human genome have been associated with multiple diseases. We present an algorithm which is mathematically sound and computationally efficient to accurately analyze CNV in a DNA sample utilizing a nanofluidic device, known as the digital array. This numerical algorithm is utilized to compute copy number variation and the associated statistical confidence interval and is based on results from probability theory and statistics. We also provide formulas which can be used as close approximations
Learning an atlas of a cognitive process in its functional geometry
Proceedings of the 22nd International Conference, IPMI 2011, Kloster Irsee, Germany, July 3-8, 2011.In this paper we construct an atlas that captures functional characteristics of a cognitive process from a population of individuals. The functional connectivity is encoded in a low-dimensional embedding space derived from a diffusion process on a graph that represents correlations of fMRI time courses. The atlas is represented by a common prior distribution for the embedded fMRI signals of all subjects. The atlas is not directly coupled to the anatomical space, and can represent functional networks that are variable in their spatial distribution. We derive an algorithm for fitting this generative model to the observed data in a population. Our results in a language fMRI study demonstrate that the method identifies coherent and functionally equivalent regions across subjects.National Science Foundation (U.S.) (IIS/CRCNS 0904625)National Science Foundation (U.S.) (CAREER grant 0642971)National Institutes of Health (U.S.) (NCRR NAC P41- RR13218)National Institute of Biomedical Imaging and Bioengineering (U.S.) (U54-EB005149)National Institutes of Health (U.S.) (U41RR019703)National Institutes of Health (U.S.) (P01CA067165)Seventh Framework Programme (European Commission) (nâ—¦257528 (KHRESMOI)
k is the Magic Number -- Inferring the Number of Clusters Through Nonparametric Concentration Inequalities
Most convex and nonconvex clustering algorithms come with one crucial
parameter: the in -means. To this day, there is not one generally
accepted way to accurately determine this parameter. Popular methods are simple
yet theoretically unfounded, such as searching for an elbow in the curve of a
given cost measure. In contrast, statistically founded methods often make
strict assumptions over the data distribution or come with their own
optimization scheme for the clustering objective. This limits either the set of
applicable datasets or clustering algorithms. In this paper, we strive to
determine the number of clusters by answering a simple question: given two
clusters, is it likely that they jointly stem from a single distribution? To
this end, we propose a bound on the probability that two clusters originate
from the distribution of the unified cluster, specified only by the sample mean
and variance. Our method is applicable as a simple wrapper to the result of any
clustering method minimizing the objective of -means, which includes
Gaussian mixtures and Spectral Clustering. We focus in our experimental
evaluation on an application for nonconvex clustering and demonstrate the
suitability of our theoretical results. Our \textsc{SpecialK} clustering
algorithm automatically determines the appropriate value for , without
requiring any data transformation or projection, and without assumptions on the
data distribution. Additionally, it is capable to decide that the data consists
of only a single cluster, which many existing algorithms cannot
- …