5,167 research outputs found
Axioms for graph clustering quality functions
We investigate properties that intuitively ought to be satisfied by graph
clustering quality functions, that is, functions that assign a score to a
clustering of a graph. Graph clustering, also known as network community
detection, is often performed by optimizing such a function. Two axioms
tailored for graph clustering quality functions are introduced, and the four
axioms introduced in previous work on distance based clustering are
reformulated and generalized for the graph setting. We show that modularity, a
standard quality function for graph clustering, does not satisfy all of these
six properties. This motivates the derivation of a new family of quality
functions, adaptive scale modularity, which does satisfy the proposed axioms.
Adaptive scale modularity has two parameters, which give greater flexibility in
the kinds of clusterings that can be found. Standard graph clustering quality
functions, such as normalized cut and unnormalized cut, are obtained as special
cases of adaptive scale modularity.
In general, the results of our investigation indicate that the considered
axiomatic framework covers existing `good' quality functions for graph
clustering, and can be used to derive an interesting new family of quality
functions.Comment: 23 pages. Full text and sources available on:
http://www.cs.ru.nl/~T.vanLaarhoven/graph-clustering-axioms-2014
Why multi-tracer surveys beat cosmic variance
Galaxy surveys that map multiple species of tracers of large-scale structure
can improve the constraints on some cosmological parameters far beyond the
limits imposed by a simplistic interpretation of cosmic variance. This
enhancement derives from comparing the relative clustering between different
tracers of large-scale structure. We present a simple but fully generic
expression for the Fisher information matrix of surveys with any (discrete)
number of tracers, and show that the enhancement of the constraints on
bias-sensitive parameters are a straightforward consequence of this
multi-tracer Fisher matrix. In fact, the relative clustering amplitudes between
tracers are eigenvectors of this multi-tracer Fisher matrix. The diagonalized
multi-tracer Fisher matrix clearly shows that while the effective volume is
bounded by the physical volume of the survey, the relational information
between species is unbounded. As an application, we study the expected
enhancements in the constraints of realistic surveys that aim at mapping
several different types of tracers of large-scale structure. The gain obtained
by combining multiple tracers is highest at low redshifts, and in one
particular scenario we analyzed, the enhancement can be as large as a factor of
~3 for the accuracy in the determination of the redshift distortion parameter,
and a factor ~5 for the local non-Gaussianity parameter. Radial and angular
distance determinations from the baryonic features in the power spectrum may
also benefit from the multi-tracer approach.Comment: New references included; 9 pages, 9 figure
Partitioning Complex Networks via Size-constrained Clustering
The most commonly used method to tackle the graph partitioning problem in
practice is the multilevel approach. During a coarsening phase, a multilevel
graph partitioning algorithm reduces the graph size by iteratively contracting
nodes and edges until the graph is small enough to be partitioned by some other
algorithm. A partition of the input graph is then constructed by successively
transferring the solution to the next finer graph and applying a local search
algorithm to improve the current solution.
In this paper, we describe a novel approach to partition graphs effectively
especially if the networks have a highly irregular structure. More precisely,
our algorithm provides graph coarsening by iteratively contracting
size-constrained clusterings that are computed using a label propagation
algorithm. The same algorithm that provides the size-constrained clusterings
can also be used during uncoarsening as a fast and simple local search
algorithm.
Depending on the algorithm's configuration, we are able to compute partitions
of very high quality outperforming all competitors, or partitions that are
comparable to the best competitor in terms of quality, hMetis, while being
nearly an order of magnitude faster on average. The fastest configuration
partitions the largest graph available to us with 3.3 billion edges using a
single machine in about ten minutes while cutting less than half of the edges
than the fastest competitor, kMetis
Distributed Graph Clustering using Modularity and Map Equation
We study large-scale, distributed graph clustering. Given an undirected
graph, our objective is to partition the nodes into disjoint sets called
clusters. A cluster should contain many internal edges while being sparsely
connected to other clusters. In the context of a social network, a cluster
could be a group of friends. Modularity and map equation are established
formalizations of this internally-dense-externally-sparse principle. We present
two versions of a simple distributed algorithm to optimize both measures. They
are based on Thrill, a distributed big data processing framework that
implements an extended MapReduce model. The algorithms for the two measures,
DSLM-Mod and DSLM-Map, differ only slightly. Adapting them for similar quality
measures is straight-forward. We conduct an extensive experimental study on
real-world graphs and on synthetic benchmark graphs with up to 68 billion
edges. Our algorithms are fast while detecting clusterings similar to those
detected by other sequential, parallel and distributed clustering algorithms.
Compared to the distributed GossipMap algorithm, DSLM-Map needs less memory, is
up to an order of magnitude faster and achieves better quality.Comment: 14 pages, 3 figures; v3: Camera ready for Euro-Par 2018, more
details, more results; v2: extended experiments to include comparison with
competing algorithms, shortened for submission to Euro-Par 201
Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes
A key issue in cluster analysis is the choice of an appropriate clustering
method and the determination of the best number of clusters. Different
clusterings are optimal on the same data set according to different criteria,
and the choice of such criteria depends on the context and aim of clustering.
Therefore, researchers need to consider what data analytic characteristics the
clusters they are aiming at are supposed to have, among others within-cluster
homogeneity, between-clusters separation, and stability. Here, a set of
internal clustering validity indexes measuring different aspects of clustering
quality is proposed, including some indexes from the literature. Users can
choose the indexes that are relevant in the application at hand. In order to
measure the overall quality of a clustering (for comparing clusterings from
different methods and/or different numbers of clusters), the index values are
calibrated for aggregation. Calibration is relative to a set of random
clusterings on the same data. Two specific aggregated indexes are proposed and
compared with existing indexes on simulated and real data.Comment: 42 pages, 11 figure
- …