323,791 research outputs found
A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm
K-means is undoubtedly the most widely used partitional clustering algorithm.
Unfortunately, due to its gradient descent nature, this algorithm is highly
sensitive to the initial placement of the cluster centers. Numerous
initialization methods have been proposed to address this problem. In this
paper, we first present an overview of these methods with an emphasis on their
computational efficiency. We then compare eight commonly used linear time
complexity initialization methods on a large and diverse collection of data
sets using various performance criteria. Finally, we analyze the experimental
results using non-parametric statistical tests and provide recommendations for
practitioners. We demonstrate that popular initialization methods often perform
poorly and that there are in fact strong alternatives to these methods.Comment: 17 pages, 1 figure, 7 table
An overview of clustering methods with guidelines for application in mental health research
Cluster analyzes have been widely used in mental health research to decompose inter-individual heterogeneity
by identifying more homogeneous subgroups of individuals. However, despite advances in new algorithms and
increasing popularity, there is little guidance on model choice, analytical framework and reporting requirements.
In this paper, we aimed to address this gap by introducing the philosophy, design, advantages/disadvantages and
implementation of major algorithms that are particularly relevant in mental health research. Extensions of basic
models, such as kernel methods, deep learning, semi-supervised clustering, and clustering ensembles are subsequently
introduced. How to choose algorithms to address common issues as well as methods for pre-clustering
data processing, clustering evaluation and validation are then discussed. Importantly, we also provide general
guidance on clustering workflow and reporting requirements. To facilitate the implementation of different algorithms,
we provide information on R functions and librarie
Document Clustering
In a world flooded with information, document clustering is an important tool that can help categorize and extract insight from text collections. It works by grouping similar documents, while simultaneously discriminating between groups. In this article, we provide a brief overview of the principal techniques used to cluster documents, and introduce a series of novel deep-learning based methods recently designed for the document clustering task. In our overview, we point the reader to salient works that can provide a deeper understanding of the topics discussed
Median topographic maps for biomedical data sets
Median clustering extends popular neural data analysis methods such as the
self-organizing map or neural gas to general data structures given by a
dissimilarity matrix only. This offers flexible and robust global data
inspection methods which are particularly suited for a variety of data as
occurs in biomedical domains. In this chapter, we give an overview about median
clustering and its properties and extensions, with a particular focus on
efficient implementations adapted to large scale data analysis
A Comparative Study Of Fuzzy C-Means And K-Means Clustering Techniques
Clustering analysis has been considered as a useful means for identifying patterns in dataset. The aim for this paper is to propose a comparison study between two well-known clustering algorithms namely fuzzy c-means (FCM) and k-means. First we present an overview of both methods with emphasis on the implementation of the algorithm. Then, we apply six datasets to measure the quality of clustering result based on the similarity measure used in the algorithm and its representation of clustering result. Next, we also optimize the fuzzification variable, m in FCM algorithm in order to improve the clustering performance. Finally we compare the performance of the experimental result for both method
Multidimensional clustering approaches for pareto-frontiers
In Data Mining large and increasing sets of data are becoming more and more common. In order to avoid losing the overview on these data-sets, preference queries are a very popular method to reduce quantities of data to high relevant information. Together with clustering methods like k-means, confusing sets of objects can be constituted and presented clearer in order to get a better overview. In this report we present on the one hand the Pareto-dominance as a very suitable and promising approach to cluster objects over better-than relationships. In order to meet someones desires, one can tip the balance of the final results to the more favored dimension if no decision for allocating objects is possible. On the other hand we introduce based on the Pareto-dominance an advanced clustering approach exploiting the Borda Social Choice voting rule to manage distances of different domains by equally weights during the clustering process
Utility-driven assessment of anonymized data via clustering
In this study, clustering is conceived as an auxiliary tool to identify groups of special interest. This
approach was applied to a real dataset concerning an entire Portuguese cohort of higher education Law
students. Several anonymized clustering scenarios were compared against the original cluster solution.
The clustering techniques were explored as data utility models in the context of data anonymization,
using k-anonymity and (ε, δ)-differential as privacy models. The purpose was to assess anonymized
data utility by standard metrics, by the characteristics of the groups obtained, and the relative risk (a
relevant metric in social sciences research). For a matter of self-containment, we present an overview
of anonymization and clustering methods. We used a partitional clustering algorithm and analyzed
several clustering validity indices to understand to what extent the data structure is preserved, or not,
after data anonymization. The results suggest that for low dimensionality/cardinality datasets the
anonymization procedure easily jeopardizes the clustering endeavor. In addition, there is evidence that
relevant field-of-study estimates obtained from anonymized data are biased.info:eu-repo/semantics/publishedVersio
- …