Search CORE

33 research outputs found

Large-scale parallel data clustering

Author: A.K. Jain
D. Judd
P.K. McKinley
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Recommended from our members

Dynamic load balancing in parallel KD-tree k-means

Author: Di Fatta Giuseppe
Pettinger David
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/06/2010
Field of study

One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy

Central Archive at the University of Reading

Crossref

A Parallel Fuzzy C-Mean algorithm for Image Segmentation

Author: Chhillar D.
Rahimi Shahram
Thakre A.
Zargham M.
Publication venue: OpenSIUC
Publication date: 01/01/2004
Field of study

This paper proposes a parallel Fuzzy C-Mean (FCM) algorithm for image segmentation. The sequential FCM algorithm is computationally intensive and has significant memory requirements. For many applications such as medical image segmentation and geographical image analysis that deal with large size images, sequential FCM is very slow. In our parallel FCM algorithm, dividing the computations among the processors and minimizing the need for accessing secondary storage, enhance the performance and efficiency of image segmentation task as compared to the sequential algorithm. such as medical image segmentation and geographical image analysis that deal with large size images, sequenrial FCM is very slow. In our parallel FCM algorithm, dividing the computations among the processors and minimizing the need for accessing secondary storage, enhance the performance and efficiency of image segmentation task as compared to the sequential algorith

Crossref

OpenSIUC

Clustering Without Knowing How To: Application and Evaluation

Author: Fedulov Daniil
Likhobaba Daniil
Ustalov Dmitry
Publication venue
Publication date: 29/03/2023
Field of study

Crowdsourcing allows running simple human intelligence tasks on a large crowd of workers, enabling solving problems for which it is difficult to formulate an algorithm or train a machine learning model in reasonable time. One of such problems is data clustering by an under-specified criterion that is simple for humans, but difficult for machines. In this demonstration paper, we build a crowdsourced system for image clustering and release its code under a free license at https://github.com/Toloka/crowdclustering. Our experiments on two different image datasets, dresses from Zalando's FEIDEGGER and shoes from the Toloka Shoes Dataset, confirm that one can yield meaningful clusters with no machine learning algorithms purely with crowdsourcing.Comment: accepted at ECIR 2023 Demonstration Trac

arXiv.org e-Print Archive

The Simulation Model Partitioning Problem: an Adaptive Solution Based on Self-Clustering (Extended Version)

Author: D'Angelo Gabriele
Publication venue: 'Elsevier BV'
Publication date: 04/11/2016
Field of study

This paper is about partitioning in parallel and distributed simulation. That means decomposing the simulation model into a numberof components and to properly allocate them on the execution units. An adaptive solution based on self-clustering, that considers both communication reduction and computational load-balancing, is proposed. The implementation of the proposed mechanism is tested using a simulation model that is challenging both in terms of structure and dynamicity. Various configurations of the simulation model and the execution environment have been considered. The obtained performance results are analyzed using a reference cost model. The results demonstrate that the proposed approach is promising and that it can reduce the simulation execution time in both parallel and distributed architectures

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Recommended from our members

Scalability of efficient parallel K-Means

Author: Di Fatta Giuseppe
Pettinger David
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular clustering algorithm is K-Means, which adopts a greedy approach to produce a set of K-clusters with associated centres of mass, and uses a squared error distortion measure to determine convergence. Methods for improving the efficiency of K-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting a more efficient data structure, notably a multi-dimensional binary search tree (KD-Tree) to store either centroids or data points. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient K-Means techniques in parallel computational environments. In this work, we provide a parallel formulation for the KD-Tree based K-Means algorithm and address its load balancing issues

Central Archive at the University of Reading

CiteSeerX

Crossref