3,808 research outputs found

    Combining Multiple Clusterings via Crowd Agreement Estimation and Multi-Granularity Link Analysis

    Full text link
    The clustering ensemble technique aims to combine multiple clusterings into a probably better and more robust clustering and has been receiving an increasing attention in recent years. There are mainly two aspects of limitations in the existing clustering ensemble approaches. Firstly, many approaches lack the ability to weight the base clusterings without access to the original data and can be affected significantly by the low-quality, or even ill clusterings. Secondly, they generally focus on the instance level or cluster level in the ensemble system and fail to integrate multi-granularity cues into a unified model. To address these two limitations, this paper proposes to solve the clustering ensemble problem via crowd agreement estimation and multi-granularity link analysis. We present the normalized crowd agreement index (NCAI) to evaluate the quality of base clusterings in an unsupervised manner and thus weight the base clusterings in accordance with their clustering validity. To explore the relationship between clusters, the source aware connected triple (SACT) similarity is introduced with regard to their common neighbors and the source reliability. Based on NCAI and multi-granularity information collected among base clusterings, clusters, and data instances, we further propose two novel consensus functions, termed weighted evidence accumulation clustering (WEAC) and graph partitioning with multi-granularity link analysis (GP-MGLA) respectively. The experiments are conducted on eight real-world datasets. The experimental results demonstrate the effectiveness and robustness of the proposed methods.Comment: The MATLAB source code of this work is available at: https://www.researchgate.net/publication/28197031

    Clustering is difficult only when it does not matter

    Full text link
    Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {\em that can be clustered well}. More generally, despite the ubiquity and the great importance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from the perspective of computational complexity theory the clustering problem seems very hard. Numerous papers introduce various criteria and numerical measures to quantify the quality of a given clustering. The resulting conclusions are pessimistic, since it is computationally difficult to find an optimal clustering of a given data set, if we go by any of these popular criteria. In contrast, the practitioners' perspective is much more optimistic. Our explanation for this disparity of opinions is that complexity theory concentrates on the worst case, whereas in reality we only care for data sets that can be clustered well. We introduce a theoretical framework of clustering in metric spaces that revolves around a notion of "good clustering". We show that if a good clustering exists, then in many cases it can be efficiently found. Our conclusion is that contrary to popular belief, clustering should not be considered a hard task

    Clustering and Validation of Microarray Data Using Consensus Clustering

    Get PDF
    Clustering is a popular method to glean useful information from microarray data. Unfortunately the results obtained from the common clustering algorithms are not consistent and even with multiple runs of different algorithms a further validation step is required. Due to absence of well defined class labels, and unknown number of clusters, the unsupervised learning problem of finding optimal clustering is hard. Obtaining a consensus of judiciously obtained clusterings not only provides stable results but also lends a high level of confidence in the quality of results. Several base algorithm runs are used to generate clusterings and a co-association matrix of pairs of points is obtained using a configurable majority criterion. Using this consensus as a similarity measure we generate a clustering using four algorithms. Synthetic as well as real world datasets are used in experiment and results obtained are compared using various internal and external validity measures. Results on real world datasets showed a marked improvement over those obtained by other researchers with the same datasets

    QCD-aware partonic jet clustering for truth-jet flavour labelling

    Get PDF
    We present an algorithm for deriving partonic flavour labels to be applied to truth particle jets in Monte Carlo event simulations. The inputs to this approach are final pre-hadronization partons, to remove dependence on unphysical details such as the order of matrix element calculation and shower generator frame recoil treatment. These are clustered using standard jet algorithms, modified to restrict the allowed pseudojet combinations to those in which tracked flavour labels are consistent with QCD and QED Feynman rules. The resulting algorithm is shown to be portable between the major families of shower generators, and largely insensitive to many possible systematic variations: it hence offers significant advantages over existing ad hoc labelling schemes. However, it is shown that contamination from multi-parton scattering simulations can disrupt the labelling results. Suggestions are made for further extension to incorporate more detailed QCD splitting function kinematics, robustness improvements, and potential uses for truth-level physics object definitions and tagging
    • …
    corecore