Search CORE

98 research outputs found

A Distributed and Approximated Nearest Neighbors Algorithm for an Efficient Large Scale Mean Shift Clustering

Author: Azzag Hanane
Beck Gaël
Cérin Christophe
Duong Tarn
Lebbah Mustapha
Publication venue
Publication date: 11/02/2019
Field of study

In this paper we target the class of modal clustering methods where clusters are defined in terms of the local modes of the probability density function which generates the data. The most well-known modal clustering method is the k-means clustering. Mean Shift clustering is a generalization of the k-means clustering which computes arbitrarily shaped clusters as defined as the basins of attraction to the local modes created by the density gradient ascent paths. Despite its potential, the Mean Shift approach is a computationally expensive method for unsupervised learning. Thus, we introduce two contributions aiming to provide clustering algorithms with a linear time complexity, as opposed to the quadratic time complexity for the exact Mean Shift clustering. Firstly we propose a scalable procedure to approximate the density gradient ascent. Second, our proposed scalable cluster labeling technique is presented. Both propositions are based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors. These two techniques may be used for moderate sized datasets. Furthermore, we show that using our proposed approximations of the density gradient ascent as a pre-processing step in other clustering methods can also improve dedicated classification metrics. For the latter, a distributed implementation, written for the Spark/Scala ecosystem is proposed. For all these considered clustering methods, we present experimental results illustrating their labeling accuracy and their potential to solve concrete problems.Comment: Algorithms are available at https://github.com/Clustering4Ever/Clustering4Eve

arXiv.org e-Print Archive

HAL-Paris 13

Multiway clustering of 3-order tensor via affinity matrix

Author: Andriantsiory Dina Faneva
Geloun Joseph Ben
Lebbah Mustapha
Publication venue
Publication date: 14/03/2023
Field of study

We propose a new method of multiway clustering for 3-order tensors via affinity matrix (MCAM). Based on a notion of similarity between the tensor slices and the spread of information of each slice, our model builds an affinity/similarity matrix on which we apply advanced clustering methods. The combination of all clusters of the three modes delivers the desired multiway clustering. Finally, MCAM achieves competitive results compared with other known algorithms on synthetics and real datasets

arXiv.org e-Print Archive

Relational Analysis for Clustering Consensus

Author: Hamid Benhadda
Mustapha Lebbah
Nistor Grozavu
Younes Bennani
Publication venue: 'IntechOpen'
Publication date: 01/02/2010
Field of study

IntechOpen

Crossref

A Distributed Rough Set Theory based Algorithm for an Efficient Big Data Pre-processing under the Spark Framework

Author: Beck Gaël
Chelly Dagdia Zaineb
Lebbah Mustapha
Zarges Christine
Publication venue: IEEE Press
Publication date: 01/12/2017
Field of study

Big Data reduction is a main point of interest across a wide variety of fields. This domain was further investigated when the difficulty in quickly acquiring the most useful information from the huge amount of data at hand was encountered. To achieve the task of data reduction, specifically feature selection, several state-of-the-art methods were proposed. However, most of them require additional information about the given data for thresholding, noise levels to be specified or they even need a feature ranking procedure. Thus, it seems necessary to think about a more adequate feature selection technique which can extract features using information contained within the dataset alone. Rough Set Theory (RST) can be used as such a technique to discover data dependencies and to reduce the number of features contained in a dataset using the data alone, requiring no additional information. However, despite being a powerful feature selection technique, RST is computationally expensive and only practical for small datasets. Therefore, in this paper, we present a novel efficient distributed Rough Set Theory based algorithm for large-scale data pre-processing under the Spark framework. Our experimental results show the efficient applicability of our RST solution to Big Data without any significant information loss.authorsversio

Crossref

Aberystwyth Research Portal

A Scalable and Effective Rough Set Theory based Approach for Big Data Pre-processing

Author: Beck Gael
Chelly Dagdia Zaineb
Lebbah Mustapha
Zarges Christine
Publication venue
Publication date: 02/05/2020
Field of study

International audienceA big challenge in the knowledge discovery process is to perform data pre-processing, specifically feature selection, on a large amount of data and high dimensional attribute set. A variety of techniques have been proposed in the literature to deal with this challenge with different degrees of success as most of these techniques need further information about the given input data for thresholding, need to specify noise levels or use some feature ranking procedures. To overcome these limitations, rough set theory (RST) can be used to discover the dependency within the data and reduce the number of attributes enclosed in an input data set while using the data alone and requiring no supplementary information. However, when it comes to massive data sets, RST reaches its limits as it is highly computationally expensive. In this paper, we propose a scalable and effective rough set theory-based approach for large-scale data pre-processing, specifically for feature selection, under the Spark framework. In our detailed experiments, data sets with up to 10,000 attributes have been considered, revealing that our proposed solution achieves a good speedup and performs its feature selection task well without sacrificing performance. Thus, making it relevant to big data

Crossref

Aberystwyth Research Portal

INRIA a CCSD electronic archive server

HAL-Paris 13

A Distributed Rough Set Theory Algorithm based on Locality Sensitive Hashing for an Efficient Big Data Pre-processing

Author: Azzag Hanene
Beck Gaël
Chelly Dagdia Zaineb
Lebbah Mustapha
Zarges Christine
Publication venue: IEEE Press
Publication date: 01/01/2018
Field of study

Aberystwyth Research Portal