1,239 research outputs found

    Macrostate Data Clustering

    Full text link
    We develop an effective nonhierarchical data clustering method using an analogy to the dynamic coarse graining of a stochastic system. Analyzing the eigensystem of an interitem transition matrix identifies fuzzy clusters corresponding to the metastable macroscopic states (macrostates) of a diffusive system. A "minimum uncertainty criterion" determines the linear transformation from eigenvectors to cluster-defining window functions. Eigenspectrum gap and cluster certainty conditions identify the proper number of clusters. The physically motivated fuzzy representation and associated uncertainty analysis distinguishes macrostate clustering from spectral partitioning methods. Macrostate data clustering solves a variety of test cases that challenge other methods.Comment: keywords: cluster analysis, clustering, pattern recognition, spectral graph theory, dynamic eigenvectors, machine learning, macrostates, classificatio

    LinkCluE: A MATLAB Package for Link-Based Cluster Ensembles

    Get PDF
    Cluster ensembles have emerged as a powerful meta-learning paradigm that provides improved accuracy and robustness by aggregating several input data clusterings. In particular, link-based similarity methods have recently been introduced with superior performance to the conventional co-association approach. This paper presents a MATLAB package, LinkCluE, that implements the link-based cluster ensemble framework. A variety of functional methods for evaluating clustering results, based on both internal and external criteria, are also provided. Additionally, the underlying algorithms together with the sample uses of the package with interesting real and synthetic datasets are demonstrated herein.

    Merging KK-means with hierarchical clustering for identifying general-shaped groups

    Get PDF
    Clustering partitions a dataset such that observations placed together in a group are similar but different from those in other groups. Hierarchical and KK-means clustering are two approaches but have different strengths and weaknesses. For instance, hierarchical clustering identifies groups in a tree-like structure but suffers from computational complexity in large datasets while KK-means clustering is efficient but designed to identify homogeneous spherically-shaped clusters. We present a hybrid non-parametric clustering approach that amalgamates the two methods to identify general-shaped clusters and that can be applied to larger datasets. Specifically, we first partition the dataset into spherical groups using KK-means. We next merge these groups using hierarchical methods with a data-driven distance measure as a stopping criterion. Our proposal has the potential to reveal groups with general shapes and structure in a dataset. We demonstrate good performance on several simulated and real datasets.Comment: 16 pages, 1 table, 9 figures; accepted for publication in Sta

    EnsCat: clustering of categorical data via ensembling

    Get PDF
    Background: Clustering is a widely used collection of unsupervised learning techniques for identifying natural classes within a data set. It is often used in bioinformatics to infer population substructure. Genomic data are often categorical and high dimensional, e.g., long sequences of nucleotides. This makes inference challenging: The distance metric is often not well-defined on categorical data; running time for computations using high dimensional data can be considerable; and the Curse of Dimensionality often impedes the interpretation of the results. Up to the present, however, the literature and software addressing clustering for categorical data has not yet led to a standard approach. Results: We present software for an ensemble method that performs well in comparison with other methods regardless of the dimensionality of the data. In an ensemble method a variety of instantiations of a statistical object are found and then combined into a consensus value. It has been known for decades that ensembling generally outperforms the components that comprise it in many settings. Here, we apply this ensembling principle to clustering. We begin by generating many hierarchical clusterings with different clustering sizes. When the dimension of the data is high, we also randomly select subspaces also of variable size, to generate clusterings. Then, we combine these clusterings into a single membership matrix and use this to obtain a new, ensembled dissimilarity matrix using Hamming distance. Conclusions: Ensemble clustering, as implemented in R and called EnsCat, gives more clearly separated clusters than other clustering techniques for categorical data. The latest version with manual and examples is available at https://github.com/jlp2duke/EnsCat

    A highly efficient multi-core algorithm for clustering extremely large datasets

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.</p> <p>Results</p> <p>We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.</p> <p>Conclusions</p> <p>Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.</p

    Ensemble clustering via heuristic optimisation

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityTraditional clustering algorithms have different criteria and biases, and there is no single algorithm that can be the best solution for a wide range of data sets. This problem often presents a significant obstacle to analysts in revealing meaningful information buried among the huge amount of data. Ensemble Clustering has been proposed as a way to avoid the biases and improve the accuracy of clustering. The difficulty in developing Ensemble Clustering methods is to combine external information (provided by input clusterings) with internal information (i.e. characteristics of given data) effectively to improve the accuracy of clustering. The work presented in this thesis focuses on enhancing the clustering accuracy of Ensemble Clustering by employing heuristic optimisation techniques to achieve a robust combination of relevant information during the consensus clustering stage. Two novel heuristic optimisation-based Ensemble Clustering methods, Multi-Optimisation Consensus Clustering (MOCC) and K-Ants Consensus Clustering (KACC), are developed and introduced in this thesis. These methods utilise two heuristic optimisation algorithms (Simulated Annealing and Ant Colony Optimisation) for their Ensemble Clustering frameworks, and have been proved to outperform other methods in the area. The extensive experimental results, together with a detailed analysis, will be presented in this thesis
    corecore