312,833 research outputs found
Scalable and Robust Community Detection with Randomized Sketching
This paper explores and analyzes the unsupervised clustering of large
partially observed graphs. We propose a scalable and provable randomized
framework for clustering graphs generated from the stochastic block model. The
clustering is first applied to a sub-matrix of the graph's adjacency matrix
associated with a reduced graph sketch constructed using random sampling. Then,
the clusters of the full graph are inferred based on the clusters extracted
from the sketch using a correlation-based retrieval step. Uniform random node
sampling is shown to improve the computational complexity over clustering of
the full graph when the cluster sizes are balanced. A new random degree-based
node sampling algorithm is presented which significantly improves upon the
performance of the clustering algorithm even when clusters are unbalanced. This
algorithm improves the phase transitions for matrix-decomposition-based
clustering with regard to computational complexity and minimum cluster size,
which are shown to be nearly dimension-free in the low inter-cluster
connectivity regime. A third sampling technique is shown to improve balance by
randomly sampling nodes based on spatial distribution. We provide analysis and
numerical results using a convex clustering algorithm based on matrix
completion
When Should You Adjust Standard Errors for Clustering?
In empirical work in economics it is common to report standard errors that
account for clustering of units. Typically, the motivation given for the
clustering adjustments is that unobserved components in outcomes for units
within clusters are correlated. However, because correlation may occur across
more than one dimension, this motivation makes it difficult to justify why
researchers use clustering in some dimensions, such as geographic, but not
others, such as age cohorts or gender. It also makes it difficult to explain
why one should not cluster with data from a randomized experiment. In this
paper, we argue that clustering is in essence a design problem, either a
sampling design or an experimental design issue. It is a sampling design issue
if sampling follows a two stage process where in the first stage, a subset of
clusters were sampled randomly from a population of clusters, while in the
second stage, units were sampled randomly from the sampled clusters. In this
case the clustering adjustment is justified by the fact that there are clusters
in the population that we do not see in the sample. Clustering is an
experimental design issue if the assignment is correlated within the clusters.
We take the view that this second perspective best fits the typical setting in
economics where clustering adjustments are used. This perspective allows us to
shed new light on three questions: (i) when should one adjust the standard
errors for clustering, (ii) when is the conventional adjustment for clustering
appropriate, and (iii) when does the conventional adjustment of the standard
errors matter
Large Scale Spectral Clustering Using Approximate Commute Time Embedding
Spectral clustering is a novel clustering method which can detect complex
shapes of data clusters. However, it requires the eigen decomposition of the
graph Laplacian matrix, which is proportion to and thus is not
suitable for large scale systems. Recently, many methods have been proposed to
accelerate the computational time of spectral clustering. These approximate
methods usually involve sampling techniques by which a lot information of the
original data may be lost. In this work, we propose a fast and accurate
spectral clustering approach using an approximate commute time embedding, which
is similar to the spectral embedding. The method does not require using any
sampling technique and computing any eigenvector at all. Instead it uses random
projection and a linear time solver to find the approximate embedding. The
experiments in several synthetic and real datasets show that the proposed
approach has better clustering quality and is faster than the state-of-the-art
approximate spectral clustering methods
Efficient computation of the Weighted Clustering Coefficient
The clustering coefficient of an unweighted network has been extensively used to quantify how tightly connected is the neighbor around a node and it has been widely adopted for assessing the quality of nodes in a social network. The computation of the clustering coefficient is challenging since it requires to count the number of triangles in the graph. Several recent works proposed efficient sampling, streaming and MapReduce algorithms that allow to overcome this computational bottleneck. As a matter of fact, the intensity of the interaction between nodes, that is usually represented with weights on the edges of the graph, is also an important measure of the statistical cohesiveness of a network. Recently various notions of weighted clustering coefficient have been proposed but all those techniques are hard to implement on large-scale graphs. In this work we show how standard sampling techniques can be used to obtain efficient estimators for the most commonly used measures of weighted clustering coefficient. Furthermore we also propose a novel graph-theoretic notion of clustering coefficient in weighted networks. © 2016, Copyright © Taylor & Francis Group, LL
- …