2 research outputs found

    A common framework of partition-based clustering for large scale dataset using sampling and its MapReduce implementation

    Get PDF
    Grupiranje (clustering) je jedan od važnih zadataka u rudarenu podataka (data mining), a algoritmi grupiranja utemeljenog na raspodjeli kao Å”to su k-način jedno su od popularnih rjeÅ”enja. Ipak, sve većim razvojem računarstva u oblaku i ogromne količine podataka, prijenos velikog broja podataka postao je veliki izazov za grupiranje. Na primjer, izvođenje algoritma grupiranja oduzima previÅ”e vremena, optimizacija parametara je teÅ”ka, a kvaliteta grupa (klastera) nije dobra. U tu smo svrhu u ovom radu predložili uobičajeni okvir za algoritme grupiranja utemeljenog na raspodjeli kao Å”to su k-način i dizajnirali njegovu MapReduce implementaciju. Posebice smo, u svrhu predstavljanja prijenosa velikog broja podataka, predložili primjenu tehnike uzorkovanja. Zatim, koristeći k-način algoritam, predlažemo uobičajeni postupak grupiranja i opisujemo primjenu na temelju k-način algoritma. Nadalje, implementiramo predloženi okvir primjenom MapReduce modela programiranja. Eksperimenti pokazuju da je naÅ”a metoda učinkovita za prijenos velikog broja podataka.Clustering is one of the significant tasks in data mining, and partition-based clustering algorithms such as k-means are one of the popular solutions. However, with the increasing development of cloud computing and big data, large scale dataset has been a big challenge for clustering. For example, the execution of clustering algorithm is too time-consuming, the optimization of parameters is difficult, and the quality of clusters is not good. To this end, in this paper, we proposed a common framework of partition-based clustering algorithms such as k-means, and designed its MapReduce implementation. Specifically, in order to deal with the representation of large scale dataset, we propose to employ sampling technique. Then, inspired by k-means algorithm, we propose a common procedure of clustering, and provide a k-means based implementation. Furthermore, we implement proposed framework using MapReduce programming model. Experiments show that our method is efficient for large scale dataset

    A Fuzzy Relational Clustering Algorithm based on a Dissimilarity Measure Extracted from Data

    No full text
    One of the critical aspects of clustering algorithms is the correct identification of the dissimilarity measure used to drive the partitioning of the data set. The dissimilarity measure induces the cluster shape and therefore determines the success of clustering algorithms. As cluster shapes change from a data set to another, dissimilarity measures should be extracted from data. To this aim, we exploit some pairs of points with known dissimilarity value to teach a dissimilarity relation to a feed-forward neural network. Then, we use the neural dissimilarity measure to guide an unsupervised relational clustering algorithm. Experiments on synthetic data sets and on the Iris data set show that the relational clustering algorithm based on the neural dissimilarity outperforms some popular clustering algorithms (with possible partial supervision) based on spatial dissimilarity
    corecore