3,808 research outputs found
Combining Multiple Clusterings via Crowd Agreement Estimation and Multi-Granularity Link Analysis
The clustering ensemble technique aims to combine multiple clusterings into a
probably better and more robust clustering and has been receiving an increasing
attention in recent years. There are mainly two aspects of limitations in the
existing clustering ensemble approaches. Firstly, many approaches lack the
ability to weight the base clusterings without access to the original data and
can be affected significantly by the low-quality, or even ill clusterings.
Secondly, they generally focus on the instance level or cluster level in the
ensemble system and fail to integrate multi-granularity cues into a unified
model. To address these two limitations, this paper proposes to solve the
clustering ensemble problem via crowd agreement estimation and
multi-granularity link analysis. We present the normalized crowd agreement
index (NCAI) to evaluate the quality of base clusterings in an unsupervised
manner and thus weight the base clusterings in accordance with their clustering
validity. To explore the relationship between clusters, the source aware
connected triple (SACT) similarity is introduced with regard to their common
neighbors and the source reliability. Based on NCAI and multi-granularity
information collected among base clusterings, clusters, and data instances, we
further propose two novel consensus functions, termed weighted evidence
accumulation clustering (WEAC) and graph partitioning with multi-granularity
link analysis (GP-MGLA) respectively. The experiments are conducted on eight
real-world datasets. The experimental results demonstrate the effectiveness and
robustness of the proposed methods.Comment: The MATLAB source code of this work is available at:
https://www.researchgate.net/publication/28197031
Clustering is difficult only when it does not matter
Numerous papers ask how difficult it is to cluster data. We suggest that the
more relevant and interesting question is how difficult it is to cluster data
sets {\em that can be clustered well}. More generally, despite the ubiquity and
the great importance of clustering, we still do not have a satisfactory
mathematical theory of clustering. In order to properly understand clustering,
it is clearly necessary to develop a solid theoretical basis for the area. For
example, from the perspective of computational complexity theory the clustering
problem seems very hard. Numerous papers introduce various criteria and
numerical measures to quantify the quality of a given clustering. The resulting
conclusions are pessimistic, since it is computationally difficult to find an
optimal clustering of a given data set, if we go by any of these popular
criteria. In contrast, the practitioners' perspective is much more optimistic.
Our explanation for this disparity of opinions is that complexity theory
concentrates on the worst case, whereas in reality we only care for data sets
that can be clustered well.
We introduce a theoretical framework of clustering in metric spaces that
revolves around a notion of "good clustering". We show that if a good
clustering exists, then in many cases it can be efficiently found. Our
conclusion is that contrary to popular belief, clustering should not be
considered a hard task
Clustering and Validation of Microarray Data Using Consensus Clustering
Clustering is a popular method to glean useful information from microarray data. Unfortunately the results obtained from the common clustering algorithms are not consistent and even with multiple runs of different algorithms a further validation step is required. Due to absence of well defined class labels, and unknown number of clusters, the unsupervised learning problem of finding optimal clustering is hard. Obtaining a consensus of judiciously obtained clusterings not only provides stable results but also lends a high level of confidence in the quality of results. Several base algorithm runs are used to generate clusterings and a co-association matrix of pairs of points is obtained using a configurable majority criterion. Using this consensus as a similarity measure we generate a clustering using four algorithms. Synthetic as well as real world datasets are used in experiment and results obtained are compared using various internal and external validity measures. Results on real world datasets showed a marked improvement over those obtained by other researchers with the same datasets
QCD-aware partonic jet clustering for truth-jet flavour labelling
We present an algorithm for deriving partonic flavour labels to be applied to
truth particle jets in Monte Carlo event simulations. The inputs to this
approach are final pre-hadronization partons, to remove dependence on
unphysical details such as the order of matrix element calculation and shower
generator frame recoil treatment. These are clustered using standard jet
algorithms, modified to restrict the allowed pseudojet combinations to those in
which tracked flavour labels are consistent with QCD and QED Feynman rules. The
resulting algorithm is shown to be portable between the major families of
shower generators, and largely insensitive to many possible systematic
variations: it hence offers significant advantages over existing ad hoc
labelling schemes. However, it is shown that contamination from multi-parton
scattering simulations can disrupt the labelling results. Suggestions are made
for further extension to incorporate more detailed QCD splitting function
kinematics, robustness improvements, and potential uses for truth-level physics
object definitions and tagging
- …