98 research outputs found
A Distributed and Approximated Nearest Neighbors Algorithm for an Efficient Large Scale Mean Shift Clustering
In this paper we target the class of modal clustering methods where clusters
are defined in terms of the local modes of the probability density function
which generates the data. The most well-known modal clustering method is the
k-means clustering. Mean Shift clustering is a generalization of the k-means
clustering which computes arbitrarily shaped clusters as defined as the basins
of attraction to the local modes created by the density gradient ascent paths.
Despite its potential, the Mean Shift approach is a computationally expensive
method for unsupervised learning. Thus, we introduce two contributions aiming
to provide clustering algorithms with a linear time complexity, as opposed to
the quadratic time complexity for the exact Mean Shift clustering. Firstly we
propose a scalable procedure to approximate the density gradient ascent.
Second, our proposed scalable cluster labeling technique is presented. Both
propositions are based on Locality Sensitive Hashing (LSH) to approximate
nearest neighbors. These two techniques may be used for moderate sized
datasets. Furthermore, we show that using our proposed approximations of the
density gradient ascent as a pre-processing step in other clustering methods
can also improve dedicated classification metrics. For the latter, a
distributed implementation, written for the Spark/Scala ecosystem is proposed.
For all these considered clustering methods, we present experimental results
illustrating their labeling accuracy and their potential to solve concrete
problems.Comment: Algorithms are available at
https://github.com/Clustering4Ever/Clustering4Eve
Multiway clustering of 3-order tensor via affinity matrix
We propose a new method of multiway clustering for 3-order tensors via
affinity matrix (MCAM). Based on a notion of similarity between the tensor
slices and the spread of information of each slice, our model builds an
affinity/similarity matrix on which we apply advanced clustering methods. The
combination of all clusters of the three modes delivers the desired multiway
clustering. Finally, MCAM achieves competitive results compared with other
known algorithms on synthetics and real datasets
A Distributed Rough Set Theory based Algorithm for an Efficient Big Data Pre-processing under the Spark Framework
Big Data reduction is a main point of interest across a wide variety of fields. This domain was further investigated when the difficulty in quickly acquiring the most useful information from the huge amount of data at hand was encountered. To achieve the task of data reduction, specifically feature selection, several state-of-the-art methods were proposed. However, most of them require additional information about the given data for thresholding, noise levels to be specified or they even need a feature ranking procedure. Thus, it seems necessary to think about a more adequate feature selection technique which can extract features using information contained within the dataset alone. Rough Set Theory (RST) can be used as such a technique to discover data dependencies and to reduce the number of features contained in a dataset using the data alone, requiring no additional information. However, despite being a powerful feature selection technique, RST is computationally expensive and only practical for small datasets. Therefore, in this paper, we present a novel efficient distributed Rough Set Theory based algorithm for large-scale data pre-processing under the Spark framework. Our experimental results show the efficient applicability of our RST solution to Big Data without any significant information loss.authorsversio
A Scalable and Effective Rough Set Theory based Approach for Big Data Pre-processing
International audienceA big challenge in the knowledge discovery process is to perform data pre-processing, specifically feature selection, on a large amount of data and high dimensional attribute set. A variety of techniques have been proposed in the literature to deal with this challenge with different degrees of success as most of these techniques need further information about the given input data for thresholding, need to specify noise levels or use some feature ranking procedures. To overcome these limitations, rough set theory (RST) can be used to discover the dependency within the data and reduce the number of attributes enclosed in an input data set while using the data alone and requiring no supplementary information. However, when it comes to massive data sets, RST reaches its limits as it is highly computationally expensive. In this paper, we propose a scalable and effective rough set theory-based approach for large-scale data pre-processing, specifically for feature selection, under the Spark framework. In our detailed experiments, data sets with up to 10,000 attributes have been considered, revealing that our proposed solution achieves a good speedup and performs its feature selection task well without sacrificing performance. Thus, making it relevant to big data
- …