7,509 research outputs found

    Parallelization of Partitioning Around Medoids (PAM) in K-Medoids Clustering on GPU

    Get PDF
    K-medoids clustering is categorized as partitional clustering. K-medoids offers better result when dealing with outliers and arbitrary distance metric also in the situation when the mean or median does not exist within data. However, k-medoids suffers a high computational complexity. Partitioning Around Medoids (PAM) has been developed to improve k-medoids clustering, consists of build and swap steps and uses the entire dataset to find the best potential medoids. Thus, PAM produces better medoids than other algorithms. This research proposes the parallelization of PAM in k-medoids clustering on GPU to reduce computational time at the swap step of PAM. The parallelization scheme utilizes shared memory, reduction algorithm, and optimization of the thread block configuration to maximize the occupancy. Based on the experiment result, the proposed parallelized PAM k-medoids is faster than CPU and Matlab implementation and efficient for large dataset

    BanditPAM++: Faster kk-medoids Clustering

    Full text link
    Clustering is a fundamental task in data science with wide-ranging applications. In kk-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used; these features allow for greater interpretability of the cluster centers and the clustering of exotic objects in kk-medoids clustering, respectively. kk-medoids clustering has recently grown in popularity due to the discovery of more efficient kk-medoids algorithms. In particular, recent research has proposed BanditPAM, a randomized kk-medoids algorithm with state-of-the-art complexity and clustering accuracy. In this paper, we present BanditPAM++, which accelerates BanditPAM via two algorithmic improvements, and is O(k)O(k) faster than BanditPAM in complexity and substantially faster than BanditPAM in wall-clock runtime. First, we demonstrate that BanditPAM has a special structure that allows the reuse of clustering information within\textit{within} each iteration. Second, we demonstrate that BanditPAM has additional structure that permits the reuse of information across\textit{across} different iterations. These observations inspire our proposed algorithm, BanditPAM++, which returns the same clustering solutions as BanditPAM but often several times faster. For example, on the CIFAR10 dataset, BanditPAM++ returns the same results as BanditPAM but runs over 10×\times faster. Finally, we provide a high-performance C++ implementation of BanditPAM++, callable from Python and R, that may be of interest to practitioners at https://github.com/motiwari/BanditPAM. Auxiliary code to reproduce all of our experiments via a one-line script is available at https://github.com/ThrunGroup/BanditPAM_plusplus_experiments.Comment: NeurIPS 202

    Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

    Full text link
    We present a generic dynamic programming method to compute the optimal clustering of nn scalar elements into kk pairwise disjoint intervals. This case includes 1D Euclidean kk-means, kk-medoids, kk-medians, kk-centers, etc. We extend the method to incorporate cluster size constraints and show how to choose the appropriate kk by model selection. Finally, we illustrate and refine the method on two case studies: Bregman clustering and statistical mixture learning maximizing the complete likelihood.Comment: 10 pages, 3 figure

    Partition Around Medoids Clustering on the Intel Xeon Phi Many-Core Coprocessor

    Full text link
    Abstract. The paper touches upon the problem of implementation Partition Around Medoids (PAM) clustering algorithm for the Intel Many Integrated Core architecture. PAM is a form of well-known k-Medoids clustering algorithm and is applied in various subject domains, e.g. bioinformatics, text analysis, intelligent transportation systems, etc. An optimized version of PAM for the Intel Xeon Phi coprocessor is introduced where OpenMP parallelizing technology, loop vectorization, tiling technique and efficient distance matrix computation for Euclidean metric are used. Experimental results for different data sets confirm the efficiency of the proposed algorithm

    Unsupervised clustering approach for network anomaly detection

    No full text
    This paper describes the advantages of using the anomaly detection approach over the misuse detection technique in detecting unknown network intrusions or attacks. It also investigates the performance of various clustering algorithms when applied to anomaly detection. Five different clustering algorithms: k-Means, improved k-Means, k-Medoids, EM clustering and distance-based outlier detection algorithms are used. Our experiment shows that misuse detection techniques, which implemented four different classifiers (naïve Bayes, rule induction, decision tree and nearest neighbour) failed to detect network traffic, which contained a large number of unknown intrusions; where the highest accuracy was only 63.97% and the lowest false positive rate was 17.90%. On the other hand, the anomaly detection module showed promising results where the distance-based outlier detection algorithm outperformed other algorithms with an accuracy of 80.15%. The accuracy for EM clustering was 78.06%, for k-Medoids it was 76.71%, for improved k-Means it was 65.40% and for k-Means it was 57.81%. Unfortunately, our anomaly detection module produces high false positive rate (more than 20%) for all four clustering algorithms. Therefore, our future work will be more focus in reducing the false positive rate and improving the accuracy using more advance machine learning technique

    Performance analysis in text clustering using k-means and k-medoids algorithms for Malay crime documents

    Get PDF
    Few studies on text clustering for the Malay language have been conducted due to some limitations that need to be addressed. The purpose of this article is to compare the two clustering algorithms of k-means and k-medoids using Euclidean distance similarity to determine which method is the best for clustering documents. Both algorithms are applied to 1000 documents pertaining to housebreaking crimes involving a variety of different modus operandi. Comparability results indicate that the k-means algorithm performed the best at clustering the relevant documents, with a 78% accuracy rate. K-means clustering also achieves the best performance for cluster evaluation when comparing the average within-cluster distance to the k-medoids algorithm. However, k-medoids perform exceptionally well on the Davis Bouldin index (DBI). Furthermore, the accuracy of k-means is dependent on the number of initial clusters, where the appropriate cluster number can be determined using the elbow method

    PERBANDINGAN ALGORITMA K-MEANS DAN K-MEDOIDS DALAM CLUSTERING RATA-RATA PENAMBAHAN KASUS COVID-19 BERDASARKAN KOTA/KABUPATEN DI PROVINSI SUMATERA SELATAN

    Get PDF
      Penyebaran yang cukup luas dan cepat, membuat pandemi Covid-19 di Sumatera Selatan berdampak negatif pada semua sektor seperti kesehatan, pekerjaan dan perekonomian. Dengan kebijakan pemerintah yang mengelompokkan wilayah penanganan Covid-19 menjadi 4 zona, perlu dievaluasi apakah pengelompokkan wilayah tersebut sudah tepat menggunakan teknik clustering data mining dengan algoritma K-Means dan K-Medoids. Dari hasil pengujian algoritma K-Means memberikan nilai DBI terbaik adalah 0.078 pada K=2. Sedangkan algoritma K-Medoids memberikan nilai DBI terbaik adalah 0.250 pada K=3. Sehingga kesimpulan yang didapatkan, pembagian wilayah penanganan Covid-19 di provinsi Sumatera Selatan dibagi menjadi 2 cluster (yaitu Kota Palembang dan Luar Kota Palembang) atau menjadi 3 cluster (yaitu Kota Palembang, dekat dengan Kota Palembang dan jauh dari Kota Palembang).   Kata kunci: Covid-19, K-Means, K-Medoids, Clustering, DB

    Perbandingan Hybrid Genetic K-Means++ dan Hybrid Genetic K-Medoid untuk Klasterisasi Dataset EEG Eyestate

    Get PDF
    K-Means++ and K-Medoids are data clustering methods. The data cluster speed is determined by the iteration value, the lower the iteration value, the faster the data clustering is done. Data clustering performance can be optimized to get more optimal clustering results. One algorithm that can optimize cluster speed is Genetic Algorithm (GA). The dataset used in the study is a dataset of EEG Eyestate. The optimization results before hybrid GA on K-Means++ are the iteration average values is 11.6 to 5,15, and in K-Medoid are the iteration average values decreased from 5.9 to 5.2. Based on the comparison of GA K-Means++ and GA K-Medoids iterations, it can be concluded that GA - K-Means++ bette
    corecore