323,791 research outputs found

    A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm

    Full text link
    K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. In this paper, we first present an overview of these methods with an emphasis on their computational efficiency. We then compare eight commonly used linear time complexity initialization methods on a large and diverse collection of data sets using various performance criteria. Finally, we analyze the experimental results using non-parametric statistical tests and provide recommendations for practitioners. We demonstrate that popular initialization methods often perform poorly and that there are in fact strong alternatives to these methods.Comment: 17 pages, 1 figure, 7 table

    An overview of clustering methods with guidelines for application in mental health research

    Get PDF
    Cluster analyzes have been widely used in mental health research to decompose inter-individual heterogeneity by identifying more homogeneous subgroups of individuals. However, despite advances in new algorithms and increasing popularity, there is little guidance on model choice, analytical framework and reporting requirements. In this paper, we aimed to address this gap by introducing the philosophy, design, advantages/disadvantages and implementation of major algorithms that are particularly relevant in mental health research. Extensions of basic models, such as kernel methods, deep learning, semi-supervised clustering, and clustering ensembles are subsequently introduced. How to choose algorithms to address common issues as well as methods for pre-clustering data processing, clustering evaluation and validation are then discussed. Importantly, we also provide general guidance on clustering workflow and reporting requirements. To facilitate the implementation of different algorithms, we provide information on R functions and librarie

    Document Clustering

    Get PDF
    In a world flooded with information, document clustering is an important tool that can help categorize and extract insight from text collections. It works by grouping similar documents, while simultaneously discriminating between groups. In this article, we provide a brief overview of the principal techniques used to cluster documents, and introduce a series of novel deep-learning based methods recently designed for the document clustering task. In our overview, we point the reader to salient works that can provide a deeper understanding of the topics discussed

    Median topographic maps for biomedical data sets

    Full text link
    Median clustering extends popular neural data analysis methods such as the self-organizing map or neural gas to general data structures given by a dissimilarity matrix only. This offers flexible and robust global data inspection methods which are particularly suited for a variety of data as occurs in biomedical domains. In this chapter, we give an overview about median clustering and its properties and extensions, with a particular focus on efficient implementations adapted to large scale data analysis

    A Comparative Study Of Fuzzy C-Means And K-Means Clustering Techniques

    Get PDF
    Clustering analysis has been considered as a useful means for identifying patterns in dataset. The aim for this paper is to propose a comparison study between two well-known clustering algorithms namely fuzzy c-means (FCM) and k-means. First we present an overview of both methods with emphasis on the implementation of the algorithm. Then, we apply six datasets to measure the quality of clustering result based on the similarity measure used in the algorithm and its representation of clustering result. Next, we also optimize the fuzzification variable, m in FCM algorithm in order to improve the clustering performance. Finally we compare the performance of the experimental result for both method

    Multidimensional clustering approaches for pareto-frontiers

    Get PDF
    In Data Mining large and increasing sets of data are becoming more and more common. In order to avoid losing the overview on these data-sets, preference queries are a very popular method to reduce quantities of data to high relevant information. Together with clustering methods like k-means, confusing sets of objects can be constituted and presented clearer in order to get a better overview. In this report we present on the one hand the Pareto-dominance as a very suitable and promising approach to cluster objects over better-than relationships. In order to meet someones desires, one can tip the balance of the final results to the more favored dimension if no decision for allocating objects is possible. On the other hand we introduce based on the Pareto-dominance an advanced clustering approach exploiting the Borda Social Choice voting rule to manage distances of different domains by equally weights during the clustering process

    Utility-driven assessment of anonymized data via clustering

    Get PDF
    In this study, clustering is conceived as an auxiliary tool to identify groups of special interest. This approach was applied to a real dataset concerning an entire Portuguese cohort of higher education Law students. Several anonymized clustering scenarios were compared against the original cluster solution. The clustering techniques were explored as data utility models in the context of data anonymization, using k-anonymity and (ε, δ)-differential as privacy models. The purpose was to assess anonymized data utility by standard metrics, by the characteristics of the groups obtained, and the relative risk (a relevant metric in social sciences research). For a matter of self-containment, we present an overview of anonymization and clustering methods. We used a partitional clustering algorithm and analyzed several clustering validity indices to understand to what extent the data structure is preserved, or not, after data anonymization. The results suggest that for low dimensionality/cardinality datasets the anonymization procedure easily jeopardizes the clustering endeavor. In addition, there is evidence that relevant field-of-study estimates obtained from anonymized data are biased.info:eu-repo/semantics/publishedVersio
    corecore