Clustering is an essential data mining technique that divides observations into
groups where each group contains similar observations. K-Means is one of the
most popular and widely used clustering algorithms that has been used for over
fifty years. The majority of the running time in the original K-Means algorithm
(known as Lloyd’s algorithm) is spent on computing distances from each data
point to all cluster centres to find the closest centre to each data point. Due to
the current exponential growth of the data, it became a necessity to improve KMeans
even further to cope with large-scale datasets, known as Big Data. Hence,
the main aim of this thesis is to improve the efficiency and scalability of Lloyd’s
K-Means.
One of the most efficient techniques to accelerate K-Means is to use triangle
inequality. Implementing such efficient techniques on a reliable distributed model
creates a powerful combination. This combination can lead to an efficient and
highly scalable parallel version of K-Means that offers a practical solution to the
problem of clustering Big Data.
MapReduce, and its popular open-source implementation known as Hadoop,
provides a distributed computing framework that efficiently stores, manages, and
processes large-scale datasets over a large cluster of commodity machines. Many
studies introduced a parallel implementation of Lloyd’s K-Means on Hadoop in
order to improve the algorithm’s scalability. This research examines methods
based on triangle inequality to achieve further improvements on the efficiency of
the parallel Lloyd’s K-Means on Hadoop.
Variants of K-Means that use triangle inequality usually require extra information,
such as distance bounds and cluster assignments, from the previous iteration
to work efficiently. This is a challenging task to achieve on Hadoop for two reasons:
1) Hadoop does not directly support iterative algorithms; and 2) Hadoop does not
allow information to be exchanged between two consecutive iterations. Hence, two
techniques are proposed to give Hadoop the ability to pass information from an
iteration to the next. The first technique uses a data structure referred to as an
Extended Vector (EV), that appends the extra information to the original data
vector. The second technique stores the extra information on files where each file
is referred to as a Bounds File (BF).
To evaluate the two proposed techniques, two K-Means variants are implemented
on Hadoop using the two techniques. Each variant is tested against variable
number of clusters, dimensions, data points, and mappers. Furthermore, the
performance of various implementations of K-Means on Hadoop and Spark is investigated.
The results show a significant improvement on the efficiency of the
new implementations compared to the Lloyd’s K-Means on Hadoop with real and
artificial datasets