802 research outputs found
Clustering in Block Markov Chains
This paper considers cluster detection in Block Markov Chains (BMCs). These
Markov chains are characterized by a block structure in their transition
matrix. More precisely, the possible states are divided into a finite
number of groups or clusters, such that states in the same cluster exhibit
the same transition rates to other states. One observes a trajectory of the
Markov chain, and the objective is to recover, from this observation only, the
(initially unknown) clusters. In this paper we devise a clustering procedure
that accurately, efficiently, and provably detects the clusters. We first
derive a fundamental information-theoretical lower bound on the detection error
rate satisfied under any clustering algorithm. This bound identifies the
parameters of the BMC, and trajectory lengths, for which it is possible to
accurately detect the clusters. We next develop two clustering algorithms that
can together accurately recover the cluster structure from the shortest
possible trajectories, whenever the parameters allow detection. These
algorithms thus reach the fundamental detectability limit, and are optimal in
that sense.Comment: 73 pages, 18 plots, second revisio
Streaming, Memory Limited Algorithms for Community Detection
In this paper, we consider sparse networks consisting of a finite number of
non-overlapping communities, i.e. disjoint clusters, so that there is higher
density within clusters than across clusters. Both the intra- and inter-cluster
edge densities vanish when the size of the graph grows large, making the
cluster reconstruction problem nosier and hence difficult to solve. We are
interested in scenarios where the network size is very large, so that the
adjacency matrix of the graph is hard to manipulate and store. The data stream
model in which columns of the adjacency matrix are revealed sequentially
constitutes a natural framework in this setting. For this model, we develop two
novel clustering algorithms that extract the clusters asymptotically
accurately. The first algorithm is {\it offline}, as it needs to store and keep
the assignments of nodes to clusters, and requires a memory that scales
linearly with the network size. The second algorithm is {\it online}, as it may
classify a node when the corresponding column is revealed and then discard this
information. This algorithm requires a memory growing sub-linearly with the
network size. To construct these efficient streaming memory-limited clustering
algorithms, we first address the problem of clustering with partial
information, where only a small proportion of the columns of the adjacency
matrix is observed and develop, for this setting, a new spectral algorithm
which is of independent interest.Comment: NIPS 201
- …