Search CORE

2 research outputs found

A common framework of partition-based clustering for large scale dataset using sampling and its MapReduce implementation

Author: Chunhai Kou
Ran Jin
Ruijuan Liu
Tao Guo
Publication venue: 'Mechanical Engineering Faculty in Slavonski Brod'
Publication date: 01/01/2016
Field of study

Grupiranje (clustering) je jedan od važnih zadataka u rudarenu podataka (data mining), a algoritmi grupiranja utemeljenog na raspodjeli kao što su k-način jedno su od popularnih rješenja. Ipak, sve većim razvojem računarstva u oblaku i ogromne količine podataka, prijenos velikog broja podataka postao je veliki izazov za grupiranje. Na primjer, izvođenje algoritma grupiranja oduzima previše vremena, optimizacija parametara je teška, a kvaliteta grupa (klastera) nije dobra. U tu smo svrhu u ovom radu predložili uobičajeni okvir za algoritme grupiranja utemeljenog na raspodjeli kao što su k-način i dizajnirali njegovu MapReduce implementaciju. Posebice smo, u svrhu predstavljanja prijenosa velikog broja podataka, predložili primjenu tehnike uzorkovanja. Zatim, koristeći k-način algoritam, predlažemo uobičajeni postupak grupiranja i opisujemo primjenu na temelju k-način algoritma. Nadalje, implementiramo predloženi okvir primjenom MapReduce modela programiranja. Eksperimenti pokazuju da je naša metoda učinkovita za prijenos velikog broja podataka.Clustering is one of the significant tasks in data mining, and partition-based clustering algorithms such as k-means are one of the popular solutions. However, with the increasing development of cloud computing and big data, large scale dataset has been a big challenge for clustering. For example, the execution of clustering algorithm is too time-consuming, the optimization of parameters is difficult, and the quality of clusters is not good. To this end, in this paper, we proposed a common framework of partition-based clustering algorithms such as k-means, and designed its MapReduce implementation. Specifically, in order to deal with the representation of large scale dataset, we propose to employ sampling technique. Then, inspired by k-means algorithm, we propose a common procedure of clustering, and provide a k-means based implementation. Furthermore, we implement proposed framework using MapReduce programming model. Experiments show that our method is efficient for large scale dataset

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

A Fuzzy Relational Clustering Algorithm based on a Dissimilarity Measure Extracted from Data

Author: CORSINI P
LAZZERINI B
MARCELLONI F
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2004
Field of study

One of the critical aspects of clustering algorithms is the correct identification of the dissimilarity measure used to drive the partitioning of the data set. The dissimilarity measure induces the cluster shape and therefore determines the success of clustering algorithms. As cluster shapes change from a data set to another, dissimilarity measures should be extracted from data. To this aim, we exploit some pairs of points with known dissimilarity value to teach a dissimilarity relation to a feed-forward neural network. Then, we use the neural dissimilarity measure to guide an unsupervised relational clustering algorithm. Experiments on synthetic data sets and on the Iris data set show that the relational clustering algorithm based on the neural dissimilarity outperforms some popular clustering algorithms (with possible partial supervision) based on spatial dissimilarity

Archivio della Ricerca - Università di Pisa