13,560 research outputs found
Community detection and stochastic block models: recent developments
The stochastic block model (SBM) is a random graph model with planted
clusters. It is widely employed as a canonical model to study clustering and
community detection, and provides generally a fertile ground to study the
statistical and computational tradeoffs that arise in network and data
sciences.
This note surveys the recent developments that establish the fundamental
limits for community detection in the SBM, both with respect to
information-theoretic and computational thresholds, and for various recovery
requirements such as exact, partial and weak recovery (a.k.a., detection). The
main results discussed are the phase transitions for exact recovery at the
Chernoff-Hellinger threshold, the phase transition for weak recovery at the
Kesten-Stigum threshold, the optimal distortion-SNR tradeoff for partial
recovery, the learning of the SBM parameters and the gap between
information-theoretic and computational thresholds.
The note also covers some of the algorithms developed in the quest of
achieving the limits, in particular two-round algorithms via graph-splitting,
semi-definite programming, linearized belief propagation, classical and
nonbacktracking spectral methods. A few open problems are also discussed
Consistency Thresholds for the Planted Bisection Model
The planted bisection model is a random graph model in which the nodes are
divided into two equal-sized communities and then edges are added randomly in a
way that depends on the community membership. We establish necessary and
sufficient conditions for the asymptotic recoverability of the planted
bisection in this model. When the bisection is asymptotically recoverable, we
give an efficient algorithm that successfully recovers it. We also show that
the planted bisection is recoverable asymptotically if and only if with high
probability every node belongs to the same community as the majority of its
neighbors.
Our algorithm for finding the planted bisection runs in time almost linear in
the number of edges. It has three stages: spectral clustering to compute an
initial guess, a "replica" stage to get almost every vertex correct, and then
some simple local moves to finish the job. An independent work by Abbe,
Bandeira, and Hall establishes similar (slightly weaker) results but only in
the case of logarithmic average degree.Comment: latest version contains an erratum, addressing an error pointed out
by Jan van Waai
Recovering Structured Probability Matrices
We consider the problem of accurately recovering a matrix B of size M by M ,
which represents a probability distribution over M2 outcomes, given access to
an observed matrix of "counts" generated by taking independent samples from the
distribution B. How can structural properties of the underlying matrix B be
leveraged to yield computationally efficient and information theoretically
optimal reconstruction algorithms? When can accurate reconstruction be
accomplished in the sparse data regime? This basic problem lies at the core of
a number of questions that are currently being considered by different
communities, including building recommendation systems and collaborative
filtering in the sparse data regime, community detection in sparse random
graphs, learning structured models such as topic models or hidden Markov
models, and the efforts from the natural language processing community to
compute "word embeddings".
Our results apply to the setting where B has a low rank structure. For this
setting, we propose an efficient algorithm that accurately recovers the
underlying M by M matrix using Theta(M) samples. This result easily translates
to Theta(M) sample algorithms for learning topic models and learning hidden
Markov Models. These linear sample complexities are optimal, up to constant
factors, in an extremely strong sense: even testing basic properties of the
underlying matrix (such as whether it has rank 1 or 2) requires Omega(M)
samples. We provide an even stronger lower bound where distinguishing whether a
sequence of observations were drawn from the uniform distribution over M
observations versus being generated by an HMM with two hidden states requires
Omega(M) observations. This precludes sublinear-sample hypothesis tests for
basic properties, such as identity or uniformity, as well as sublinear sample
estimators for quantities such as the entropy rate of HMMs
Reconstructing pedigrees: some identifiability questions for a recombination-mutation model
Pedigrees are directed acyclic graphs that represent ancestral relationships
between individuals in a population. Based on a schematic recombination
process, we describe two simple Markov models for sequences evolving on
pedigrees - Model R (recombinations without mutations) and Model RM
(recombinations with mutations). For these models, we ask an identifiability
question: is it possible to construct a pedigree from the joint probability
distribution of extant sequences? We present partial identifiability results
for general pedigrees: we show that when the crossover probabilities are
sufficiently small, certain spanning subgraph sequences can be counted from the
joint distribution of extant sequences. We demonstrate how pedigrees that
earlier seemed difficult to distinguish are distinguished by counting their
spanning subgraph sequences.Comment: 40 pages, 9 figure
Language as a Latent Variable: Discrete Generative Models for Sentence Compression
In this work we explore deep generative models of text in which the latent
representation of a document is itself drawn from a discrete language model
distribution. We formulate a variational auto-encoder for inference in this
model and apply it to the task of compressing sentences. In this application
the generative model first draws a latent summary sentence from a background
language model, and then subsequently draws the observed sentence conditioned
on this latent summary. In our empirical evaluation we show that generative
formulations of both abstractive and extractive compression yield
state-of-the-art results when trained on a large amount of supervised data.
Further, we explore semi-supervised compression scenarios where we show that it
is possible to achieve performance competitive with previously proposed
supervised models while training on a fraction of the supervised data.Comment: EMNLP 201
Clustering from Sparse Pairwise Measurements
We consider the problem of grouping items into clusters based on few random
pairwise comparisons between the items. We introduce three closely related
algorithms for this task: a belief propagation algorithm approximating the
Bayes optimal solution, and two spectral algorithms based on the
non-backtracking and Bethe Hessian operators. For the case of two symmetric
clusters, we conjecture that these algorithms are asymptotically optimal in
that they detect the clusters as soon as it is information theoretically
possible to do so. We substantiate this claim for one of the spectral
approaches we introduce
- …