33 research outputs found
Recommended from our members
Dynamic load balancing in parallel KD-tree k-means
One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis.
Techniques for improving the efficiency of k-Means have been
largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing
issue. Three solutions have been developed and tested. Two
approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy
A Parallel Fuzzy C-Mean algorithm for Image Segmentation
This paper proposes a parallel Fuzzy C-Mean (FCM) algorithm for image segmentation. The sequential FCM algorithm is computationally intensive and has significant memory requirements. For many applications such as medical image segmentation and geographical image analysis that deal with large size images, sequential FCM is very slow. In our parallel FCM algorithm, dividing the computations among the processors and minimizing the need for accessing secondary storage, enhance the performance and efficiency of image segmentation task as compared to the sequential algorithm. such as medical image segmentation and geographical image analysis that deal with large size images, sequenrial FCM is very slow. In our parallel FCM algorithm, dividing the computations among the processors and minimizing the need for accessing secondary storage, enhance the performance and efficiency of image segmentation task as compared to the sequential algorith
Clustering Without Knowing How To: Application and Evaluation
Crowdsourcing allows running simple human intelligence tasks on a large crowd
of workers, enabling solving problems for which it is difficult to formulate an
algorithm or train a machine learning model in reasonable time. One of such
problems is data clustering by an under-specified criterion that is simple for
humans, but difficult for machines. In this demonstration paper, we build a
crowdsourced system for image clustering and release its code under a free
license at https://github.com/Toloka/crowdclustering. Our experiments on two
different image datasets, dresses from Zalando's FEIDEGGER and shoes from the
Toloka Shoes Dataset, confirm that one can yield meaningful clusters with no
machine learning algorithms purely with crowdsourcing.Comment: accepted at ECIR 2023 Demonstration Trac
The Simulation Model Partitioning Problem: an Adaptive Solution Based on Self-Clustering (Extended Version)
This paper is about partitioning in parallel and distributed simulation. That
means decomposing the simulation model into a numberof components and to
properly allocate them on the execution units. An adaptive solution based on
self-clustering, that considers both communication reduction and computational
load-balancing, is proposed. The implementation of the proposed mechanism is
tested using a simulation model that is challenging both in terms of structure
and dynamicity. Various configurations of the simulation model and the
execution environment have been considered. The obtained performance results
are analyzed using a reference cost model. The results demonstrate that the
proposed approach is promising and that it can reduce the simulation execution
time in both parallel and distributed architectures
Recommended from our members
Scalability of efficient parallel K-Means
Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular clustering algorithm is K-Means, which adopts a greedy approach to produce a set of K-clusters with associated centres of mass, and uses a squared error distortion measure to determine convergence. Methods for improving the efficiency of K-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting a more efficient data structure, notably a multi-dimensional binary search tree (KD-Tree) to store either centroids or data points. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient K-Means techniques in parallel computational environments. In this work, we provide a parallel formulation for the KD-Tree based K-Means algorithm and address its load balancing issues