977 research outputs found

    The k-means algorithm: A comprehensive survey and performance evaluation

    Get PDF
    © 2020 by the authors. Licensee MDPI, Basel, Switzerland. The k-means clustering algorithm is considered one of the most powerful and popular data mining algorithms in the research community. However, despite its popularity, the algorithm has certain limitations, including problems associated with random initialization of the centroids which leads to unexpected convergence. Additionally, such a clustering algorithm requires the number of clusters to be defined beforehand, which is responsible for different cluster shapes and outlier effects. A fundamental problem of the k-means algorithm is its inability to handle various data types. This paper provides a structured and synoptic overview of research conducted on the k-means algorithm to overcome such shortcomings. Variants of the k-means algorithms including their recent developments are discussed, where their effectiveness is investigated based on the experimental analysis of a variety of datasets. The detailed experimental analysis along with a thorough comparison among different k-means clustering algorithms differentiates our work compared to other existing survey papers. Furthermore, it outlines a clear and thorough understanding of the k-means algorithm along with its different research directions

    Clustering Algorithm Based on Sparse Feature Vector without Specifying Parameter

    Get PDF
    Parameter setting is an essential factor affecting algorithm performance in data mining techniques. CABOSFV is an efficient clustering algorithm which can cluster binary data with sparse features, but it is challenging to specify the threshold parameter. To solve the difficulty of parameter decision, a clustering algorithm based on sparse feature vector without specifying parameter (CASP) is proposed in this paper. The calculation method of an upper limit of threshold is firstly defined to determine the range of threshold. Furthermore, we use the sparseness index to sort the data and conduct the clustering process based on the adjusted sparse feature vector after data sorting. An interval search strategy is adopted to find a suitable threshold within the defined threshold range, and the clustering result with the selected suitable parameter is the outcome. Experiments on 7 UCI datasets demonstrate that the clustering results of the CASP algorithm are superior to other baselines in terms of both effectiveness and efficiency. CASP not only simplifies the parameter decision process, but also obtains desirable clustering results quickly and stably, which shows the practicability of the algorithm

    Clustering heterogeneous categorical data using enhanced mini batch K-means with entropy distance measure

    Get PDF
    Clustering methods in data mining aim to group a set of patterns based on their similarity. In a data survey, heterogeneous information is established with various types of data scales like nominal, ordinal, binary, and Likert scales. A lack of treatment of heterogeneous data and information leads to loss of information and scanty decision-making. Although many similarity measures have been established, solutions for heterogeneous data in clustering are still lacking. The recent entropy distance measure seems to provide good results for the heterogeneous categorical data. However, it requires many experiments and evaluations. This article presents a proposed framework for heterogeneous categorical data solution using a mini batch k-means with entropy measure (MBKEM) which is to investigate the effectiveness of similarity measure in clustering method using heterogeneous categorical data. Secondary data from a public survey was used. The findings demonstrate the proposed framework has improved the clustering’s quality. MBKEM outperformed other clustering algorithms with the accuracy at 0.88, v-measure (VM) at 0.82, adjusted rand index (ARI) at 0.87, and Fowlkes-Mallow’s index (FMI) at 0.94. It is observed that the average minimum elapsed time-varying for cluster generation, k at 0.26 s. In the future, the proposed solution would be beneficial for improving the quality of clustering for heterogeneous categorical data problems in many domains

    EnsCat: clustering of categorical data via ensembling

    Get PDF
    Background: Clustering is a widely used collection of unsupervised learning techniques for identifying natural classes within a data set. It is often used in bioinformatics to infer population substructure. Genomic data are often categorical and high dimensional, e.g., long sequences of nucleotides. This makes inference challenging: The distance metric is often not well-defined on categorical data; running time for computations using high dimensional data can be considerable; and the Curse of Dimensionality often impedes the interpretation of the results. Up to the present, however, the literature and software addressing clustering for categorical data has not yet led to a standard approach. Results: We present software for an ensemble method that performs well in comparison with other methods regardless of the dimensionality of the data. In an ensemble method a variety of instantiations of a statistical object are found and then combined into a consensus value. It has been known for decades that ensembling generally outperforms the components that comprise it in many settings. Here, we apply this ensembling principle to clustering. We begin by generating many hierarchical clusterings with different clustering sizes. When the dimension of the data is high, we also randomly select subspaces also of variable size, to generate clusterings. Then, we combine these clusterings into a single membership matrix and use this to obtain a new, ensembled dissimilarity matrix using Hamming distance. Conclusions: Ensemble clustering, as implemented in R and called EnsCat, gives more clearly separated clusters than other clustering techniques for categorical data. The latest version with manual and examples is available at https://github.com/jlp2duke/EnsCat

    New methods for discovering local behaviour in mixed databases

    Full text link
    Clustering techniques are widely used. There are many applications where it is desired to find automatically groups or hidden information in the data set. Finding a model of the system based in the integration of several local models is placed among other applications. Local model could have many structures; however, a linear structure is the most common one, due to its simplicity. This work aims at finding improvements in several fields, but all them will be applied to this finding of a set of local models in a database. On the one hand, a way of codifying the categorical information into numerical values has been designed, in order to apply a numerical algorithm to the whole data set. On the other hand, a cost index has been developed, which will be optimized globally, to find the parameters of the local clusters that best define the output of the process. Each of the techniques has been applied to several experiments and results show the improvements over the actual techniques.BarcelĂł Rico, F. (2009). New methods for discovering local behaviour in mixed databases. http://hdl.handle.net/10251/12739Archivo delegad

    Statistical Data Modeling and Machine Learning with Applications

    Get PDF
    The modeling and processing of empirical data is one of the main subjects and goals of statistics. Nowadays, with the development of computer science, the extraction of useful and often hidden information and patterns from data sets of different volumes and complex data sets in warehouses has been added to these goals. New and powerful statistical techniques with machine learning (ML) and data mining paradigms have been developed. To one degree or another, all of these techniques and algorithms originate from a rigorous mathematical basis, including probability theory and mathematical statistics, operational research, mathematical analysis, numerical methods, etc. Popular ML methods, such as artificial neural networks (ANN), support vector machines (SVM), decision trees, random forest (RF), among others, have generated models that can be considered as straightforward applications of optimization theory and statistical estimation. The wide arsenal of classical statistical approaches combined with powerful ML techniques allows many challenging and practical problems to be solved. This Special Issue belongs to the section “Mathematics and Computer Science”. Its aim is to establish a brief collection of carefully selected papers presenting new and original methods, data analyses, case studies, comparative studies, and other research on the topic of statistical data modeling and ML as well as their applications. Particular attention is given, but is not limited, to theories and applications in diverse areas such as computer science, medicine, engineering, banking, education, sociology, economics, among others. The resulting palette of methods, algorithms, and applications for statistical modeling and ML presented in this Special Issue is expected to contribute to the further development of research in this area. We also believe that the new knowledge acquired here as well as the applied results are attractive and useful for young scientists, doctoral students, and researchers from various scientific specialties
    • …
    corecore