12,048 research outputs found
Recommended from our members
Efficient clustering techniques for big data
Clustering is an essential data mining technique that divides observations into
groups where each group contains similar observations. K-Means is one of the
most popular and widely used clustering algorithms that has been used for over
fifty years. The majority of the running time in the original K-Means algorithm
(known as Lloyd’s algorithm) is spent on computing distances from each data
point to all cluster centres to find the closest centre to each data point. Due to
the current exponential growth of the data, it became a necessity to improve KMeans
even further to cope with large-scale datasets, known as Big Data. Hence,
the main aim of this thesis is to improve the efficiency and scalability of Lloyd’s
K-Means.
One of the most efficient techniques to accelerate K-Means is to use triangle
inequality. Implementing such efficient techniques on a reliable distributed model
creates a powerful combination. This combination can lead to an efficient and
highly scalable parallel version of K-Means that offers a practical solution to the
problem of clustering Big Data.
MapReduce, and its popular open-source implementation known as Hadoop,
provides a distributed computing framework that efficiently stores, manages, and
processes large-scale datasets over a large cluster of commodity machines. Many
studies introduced a parallel implementation of Lloyd’s K-Means on Hadoop in
order to improve the algorithm’s scalability. This research examines methods
based on triangle inequality to achieve further improvements on the efficiency of
the parallel Lloyd’s K-Means on Hadoop.
Variants of K-Means that use triangle inequality usually require extra information,
such as distance bounds and cluster assignments, from the previous iteration
to work efficiently. This is a challenging task to achieve on Hadoop for two reasons:
1) Hadoop does not directly support iterative algorithms; and 2) Hadoop does not
allow information to be exchanged between two consecutive iterations. Hence, two
techniques are proposed to give Hadoop the ability to pass information from an
iteration to the next. The first technique uses a data structure referred to as an
Extended Vector (EV), that appends the extra information to the original data
vector. The second technique stores the extra information on files where each file
is referred to as a Bounds File (BF).
To evaluate the two proposed techniques, two K-Means variants are implemented
on Hadoop using the two techniques. Each variant is tested against variable
number of clusters, dimensions, data points, and mappers. Furthermore, the
performance of various implementations of K-Means on Hadoop and Spark is investigated.
The results show a significant improvement on the efficiency of the
new implementations compared to the Lloyd’s K-Means on Hadoop with real and
artificial datasets
Fast Color Quantization Using Weighted Sort-Means Clustering
Color quantization is an important operation with numerous applications in
graphics and image processing. Most quantization methods are essentially based
on data clustering algorithms. However, despite its popularity as a general
purpose clustering algorithm, k-means has not received much respect in the
color quantization literature because of its high computational requirements
and sensitivity to initialization. In this paper, a fast color quantization
method based on k-means is presented. The method involves several modifications
to the conventional (batch) k-means algorithm including data reduction, sample
weighting, and the use of triangle inequality to speed up the nearest neighbor
search. Experiments on a diverse set of images demonstrate that, with the
proposed modifications, k-means becomes very competitive with state-of-the-art
color quantization methods in terms of both effectiveness and efficiency.Comment: 30 pages, 2 figures, 4 table
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
Faster K-Means Cluster Estimation
There has been considerable work on improving popular clustering algorithm
`K-means' in terms of mean squared error (MSE) and speed, both. However, most
of the k-means variants tend to compute distance of each data point to each
cluster centroid for every iteration. We propose a fast heuristic to overcome
this bottleneck with only marginal increase in MSE. We observe that across all
iterations of K-means, a data point changes its membership only among a small
subset of clusters. Our heuristic predicts such clusters for each data point by
looking at nearby clusters after the first iteration of k-means. We augment
well known variants of k-means with our heuristic to demonstrate effectiveness
of our heuristic. For various synthetic and real-world datasets, our heuristic
achieves speed-up of up-to 3 times when compared to efficient variants of
k-means.Comment: 6 pages, Accepted at ECIR 201
Indexing Metric Spaces for Exact Similarity Search
With the continued digitalization of societal processes, we are seeing an
explosion in available data. This is referred to as big data. In a research
setting, three aspects of the data are often viewed as the main sources of
challenges when attempting to enable value creation from big data: volume,
velocity and variety. Many studies address volume or velocity, while much fewer
studies concern the variety. Metric space is ideal for addressing variety
because it can accommodate any type of data as long as its associated distance
notion satisfies the triangle inequality. To accelerate search in metric space,
a collection of indexing techniques for metric data have been proposed.
However, existing surveys each offers only a narrow coverage, and no
comprehensive empirical study of those techniques exists. We offer a survey of
all the existing metric indexes that can support exact similarity search, by i)
summarizing all the existing partitioning, pruning and validation techniques
used for metric indexes, ii) providing the time and storage complexity analysis
on the index construction, and iii) report on a comprehensive empirical
comparison of their similarity query processing performance. Here, empirical
comparisons are used to evaluate the index performance during search as it is
hard to see the complexity analysis differences on the similarity query
processing and the query performance depends on the pruning and validation
abilities related to the data distribution. This article aims at revealing
different strengths and weaknesses of different indexing techniques in order to
offer guidance on selecting an appropriate indexing technique for a given
setting, and directing the future research for metric indexes
- …