162,187 research outputs found

    An efficient kk-means-type algorithm for clustering datasets with incomplete records

    Get PDF
    The kk-means algorithm is arguably the most popular nonparametric clustering method but cannot generally be applied to datasets with incomplete records. The usual practice then is to either impute missing values under an assumed missing-completely-at-random mechanism or to ignore the incomplete records, and apply the algorithm on the resulting dataset. We develop an efficient version of the kk-means algorithm that allows for clustering in the presence of incomplete records. Our extension is called kmk_m-means and reduces to the kk-means algorithm when all records are complete. We also provide initialization strategies for our algorithm and methods to estimate the number of groups in the dataset. Illustrations and simulations demonstrate the efficacy of our approach in a variety of settings and patterns of missing data. Our methods are also applied to the analysis of activation images obtained from a functional Magnetic Resonance Imaging experiment.Comment: 21 pages, 12 figures, 3 tables, in press, Statistical Analysis and Data Mining -- The ASA Data Science Journal, 201

    K

    Get PDF
    The Affinity Propagation (AP) algorithm is an effective algorithm for clustering analysis, but it can not be directly applicable to the case of incomplete data. In view of the prevalence of missing data and the uncertainty of missing attributes, we put forward a modified AP clustering algorithm based on K-nearest neighbor intervals (KNNI) for incomplete data. Based on an Improved Partial Data Strategy, the proposed algorithm estimates the KNNI representation of missing attributes by using the attribute distribution information of the available data. The similarity function can be changed by dealing with the interval data. Then the improved AP algorithm can be applicable to the case of incomplete data. Experiments on several UCI datasets show that the proposed algorithm achieves impressive clustering results

    Multi-Source Multi-View Clustering via Discrepancy Penalty

    Full text link
    With the advance of technology, entities can be observed in multiple views. Multiple views containing different types of features can be used for clustering. Although multi-view clustering has been successfully applied in many applications, the previous methods usually assume the complete instance mapping between different views. In many real-world applications, information can be gathered from multiple sources, while each source can contain multiple views, which are more cohesive for learning. The views under the same source are usually fully mapped, but they can be very heterogeneous. Moreover, the mappings between different sources are usually incomplete and partially observed, which makes it more difficult to integrate all the views across different sources. In this paper, we propose MMC (Multi-source Multi-view Clustering), which is a framework based on collective spectral clustering with a discrepancy penalty across sources, to tackle these challenges. MMC has several advantages compared with other existing methods. First, MMC can deal with incomplete mapping between sources. Second, it considers the disagreements between sources while treating views in the same source as a cohesive set. Third, MMC also tries to infer the instance similarities across sources to enhance the clustering performance. Extensive experiments conducted on real-world data demonstrate the effectiveness of the proposed approach

    Constrained Optimization for a Subset of the Gaussian Parsimonious Clustering Models

    Full text link
    The expectation-maximization (EM) algorithm is an iterative method for finding maximum likelihood estimates when data are incomplete or are treated as being incomplete. The EM algorithm and its variants are commonly used for parameter estimation in applications of mixture models for clustering and classification. This despite the fact that even the Gaussian mixture model likelihood surface contains many local maxima and is singularity riddled. Previous work has focused on circumventing this problem by constraining the smallest eigenvalue of the component covariance matrices. In this paper, we consider constraining the smallest eigenvalue, the largest eigenvalue, and both the smallest and largest within the family setting. Specifically, a subset of the GPCM family is considered for model-based clustering, where we use a re-parameterized version of the famous eigenvalue decomposition of the component covariance matrices. Our approach is illustrated using various experiments with simulated and real data

    Kernel Spectral Clustering and applications

    Full text link
    In this chapter we review the main literature related to kernel spectral clustering (KSC), an approach to clustering cast within a kernel-based optimization setting. KSC represents a least-squares support vector machine based formulation of spectral clustering described by a weighted kernel PCA objective. Just as in the classifier case, the binary clustering model is expressed by a hyperplane in a high dimensional space induced by a kernel. In addition, the multi-way clustering can be obtained by combining a set of binary decision functions via an Error Correcting Output Codes (ECOC) encoding scheme. Because of its model-based nature, the KSC method encompasses three main steps: training, validation, testing. In the validation stage model selection is performed to obtain tuning parameters, like the number of clusters present in the data. This is a major advantage compared to classical spectral clustering where the determination of the clustering parameters is unclear and relies on heuristics. Once a KSC model is trained on a small subset of the entire data, it is able to generalize well to unseen test points. Beyond the basic formulation, sparse KSC algorithms based on the Incomplete Cholesky Decomposition (ICD) and L0L_0, L1,L0+L1L_1, L_0 + L_1, Group Lasso regularization are reviewed. In that respect, we show how it is possible to handle large scale data. Also, two possible ways to perform hierarchical clustering and a soft clustering method are presented. Finally, real-world applications such as image segmentation, power load time-series clustering, document clustering and big data learning are considered.Comment: chapter contribution to the book "Unsupervised Learning Algorithms

    Two-stage clustering in genotype-by-environment analyses with missing data

    Get PDF
    Cluster analysis has been commonly used in genotype-by-environment (G x E) analyses, but current methods are inadequate when the data matrix is incomplete. This paper proposes a new method, referred to as two-stage clustering, which relies on a partitioning of squared Euclidean distance into two independent components, the G x E interaction and the genotype main effect. These components are used in the first and second stages of clustering respectively. Two-stage clustering forms the basis for imputing missing values in the G x E matrix so that a more complete data array is available for other GxE analyses. Imputation for a given genotype uses information from genotypes with similar interaction profiles. This imputation method is shown to improve on an existing nearest cluster method that confounds the G x E interaction and the genotype main effect

    Implementasi Self-Organizing Fuzzy Maps Pada Incomplete Data Untuk Pengelompokan Gizi Bahan Pangan

    Get PDF
    Incomplete data merupakan permasalahan yang dapat mempengaruhi hasil clustering pada kasus yang berkaitan dengan pengenalan pola maupun clustering. Incomplete data menjadi kelemahan dalam clustering dimana hampir semua metode clustering hanya dapat bekerja pada data yang lengkap. Hal ini yang mendorong perlunya dicari metode khusus untuk penanganan terhadap permasalahan yang berkaitan dengan incomplete data. Salah satu permasalahan dalam pengolahan data adalah pada pengelompokan data bahan pangan berdasarkan kandungan gizinya. Hal ini disebabkan karena data gizi bahan pangan merupakan incomplete data. Berdasarkan hal tersebut, pada penelitian ini akan digunakan Self-Organizing Fuzzy Maps untuk mengelompokkan data bahan pangan yang merupakan incomplete data. Setelah dilakukan implementasi dan analisis hasil, didapatkan bahwa Self-Organizing Fuzzy Maps dapat mengelompokkan data bahan pangan yang merupakan incomplete data berdasarkan kandungan gizinya. Berdasarkan hasil tersebut, dapat dibuktikan bahwa Self-Organizing Fuzzy Maps merupakan algoritma clustering yang dapat bekerja pada incomplete data. ================================================================= Incomplete data is a problem that can affect the results of clustering on many cases related to the pattern recognition or clustering. Incomplete data become a weakness in clustering, where nearly all methods of clustering can only work on a complete data. This encourages the need to find specific methods for handling against problems associated with incomplete data. One of the problems in the processing of incomplete data is in the clustering of food based on the content of its nutrition value. This is because food nutrient data is incomplete data. Based on that, this research will use Self-Organizing Fuzzy Maps to cluster food nutrient data which is incomplete data. After the implementation and analysis of the results, obtained that Self-Organizing Fuzzy Maps can cluster food nutrient data which is incomplete data based on the content of its nutrition value. Based on these results, it can be proved that Self-Organizing Fuzzy Maps is a clustering algorithm that can be used on incomplete data

    Clustering with missing data: which equivalent for Rubin's rules?

    Full text link
    Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.Comment: 39 page
    corecore