234 research outputs found
Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm
Over the past five decades, k-means has become the clustering algorithm of
choice in many application domains primarily due to its simplicity, time/space
efficiency, and invariance to the ordering of the data points. Unfortunately,
the algorithm's sensitivity to the initial selection of the cluster centers
remains to be its most serious drawback. Numerous initialization methods have
been proposed to address this drawback. Many of these methods, however, have
time complexity superlinear in the number of data points, which makes them
impractical for large data sets. On the other hand, linear methods are often
random and/or sensitive to the ordering of the data points. These methods are
generally unreliable in that the quality of their results is unpredictable.
Therefore, it is common practice to perform multiple runs of such methods and
take the output of the run that produces the best results. Such a practice,
however, greatly increases the computational requirements of the otherwise
highly efficient k-means algorithm. In this chapter, we investigate the
empirical performance of six linear, deterministic (non-random), and
order-invariant k-means initialization methods on a large and diverse
collection of data sets from the UCI Machine Learning Repository. The results
demonstrate that two relatively unknown hierarchical initialization methods due
to Su and Dy outperform the remaining four methods with respect to two
objective effectiveness criteria. In addition, a recent method due to Erisoglu
et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms
(Springer, 2014). arXiv admin note: substantial text overlap with
arXiv:1304.7465, arXiv:1209.196
Adaptive firefly algorithm for hierarchical text clustering
Text clustering is essentially used by search engines to increase the recall and precision in information retrieval. As search engine operates on Internet content that is constantly being updated, there is a need for a clustering algorithm that offers automatic grouping of items without prior knowledge on the collection. Existing clustering methods have problems in determining optimal number of clusters and producing compact clusters. In this research, an adaptive hierarchical text clustering
algorithm is proposed based on Firefly Algorithm. The proposed Adaptive Firefly Algorithm (AFA) consists of three components: document clustering, cluster refining, and cluster merging. The first component introduces Weight-based Firefly Algorithm (WFA) that automatically identifies initial centers and their clusters for any given text collection. In order to refine the obtained clusters, a second algorithm, termed as Weight-based Firefly Algorithm with Relocate (WFAR), is proposed. Such an approach allows the relocation of a pre-assigned document into a newly created cluster. The third component, Weight-based Firefly Algorithm with Relocate and
Merging (WFARM), aims to reduce the number of produced clusters by merging nonpure clusters into the pure ones. Experiments were conducted to compare the proposed algorithms against seven existing methods. The percentage of success in
obtaining optimal number of clusters by AFA is 100% with purity and f-measure of 83% higher than the benchmarked methods. As for entropy measure, the AFA produced the lowest value (0.78) when compared to existing methods. The result indicates that Adaptive Firefly Algorithm can produce compact clusters. This research contributes to the text mining domain as hierarchical text clustering
facilitates the indexing of documents and information retrieval processes
Development of a R package to facilitate the learning of clustering techniques
This project explores the development of a tool, in the form of a R package, to ease the process of
learning clustering techniques, how they work and what their pros and cons are. This tool should provide
implementations for several different clustering techniques with explanations in order to allow the student
to get familiar with the characteristics of each algorithm by testing them against several different datasets
while deepening their understanding of them through the explanations. Additionally, these explanations
should adapt to the input data, making the tool not only adept for self-regulated learning but for teaching
too.Grado en IngenierĂa Informátic
Hybrid Clustering with Application to Web Pages
This project explains the process of clustering web pages. With the immense increase in the number of web pages available on the internet, it has become difficult to search for web pages. The clustering of web pages will improve the presentation of web pages to the user and saves the time spent on searching web pages. Various clustering techniques have been proposed by various research scientists to cluster the web pages, but all the techniques suggested have some drawbacks. Since there is lot of scope for further improvement in the field of clustering, the system proposed in this report takes the clustering of web pages a step ahead. The proposed system use the queries from the user and get the results from search engine, then processes the results and provides the final result clusters to users
- …