544 research outputs found
Recommended from our members
Efficient clustering techniques for big data
Clustering is an essential data mining technique that divides observations into
groups where each group contains similar observations. K-Means is one of the
most popular and widely used clustering algorithms that has been used for over
fifty years. The majority of the running time in the original K-Means algorithm
(known as Lloyd’s algorithm) is spent on computing distances from each data
point to all cluster centres to find the closest centre to each data point. Due to
the current exponential growth of the data, it became a necessity to improve KMeans
even further to cope with large-scale datasets, known as Big Data. Hence,
the main aim of this thesis is to improve the efficiency and scalability of Lloyd’s
K-Means.
One of the most efficient techniques to accelerate K-Means is to use triangle
inequality. Implementing such efficient techniques on a reliable distributed model
creates a powerful combination. This combination can lead to an efficient and
highly scalable parallel version of K-Means that offers a practical solution to the
problem of clustering Big Data.
MapReduce, and its popular open-source implementation known as Hadoop,
provides a distributed computing framework that efficiently stores, manages, and
processes large-scale datasets over a large cluster of commodity machines. Many
studies introduced a parallel implementation of Lloyd’s K-Means on Hadoop in
order to improve the algorithm’s scalability. This research examines methods
based on triangle inequality to achieve further improvements on the efficiency of
the parallel Lloyd’s K-Means on Hadoop.
Variants of K-Means that use triangle inequality usually require extra information,
such as distance bounds and cluster assignments, from the previous iteration
to work efficiently. This is a challenging task to achieve on Hadoop for two reasons:
1) Hadoop does not directly support iterative algorithms; and 2) Hadoop does not
allow information to be exchanged between two consecutive iterations. Hence, two
techniques are proposed to give Hadoop the ability to pass information from an
iteration to the next. The first technique uses a data structure referred to as an
Extended Vector (EV), that appends the extra information to the original data
vector. The second technique stores the extra information on files where each file
is referred to as a Bounds File (BF).
To evaluate the two proposed techniques, two K-Means variants are implemented
on Hadoop using the two techniques. Each variant is tested against variable
number of clusters, dimensions, data points, and mappers. Furthermore, the
performance of various implementations of K-Means on Hadoop and Spark is investigated.
The results show a significant improvement on the efficiency of the
new implementations compared to the Lloyd’s K-Means on Hadoop with real and
artificial datasets
Parallel Algorithms for Geometric Graph Problems
We give algorithms for geometric graph problems in the modern parallel models
inspired by MapReduce. For example, for the Minimum Spanning Tree (MST) problem
over a set of points in the two-dimensional space, our algorithm computes a
-approximate MST. Our algorithms work in a constant number of
rounds of communication, while using total space and communication proportional
to the size of the data (linear space and near linear time algorithms). In
contrast, for general graphs, achieving the same result for MST (or even
connectivity) remains a challenging open problem, despite drawing significant
attention in recent years.
We develop a general algorithmic framework that, besides MST, also applies to
Earth-Mover Distance (EMD) and the transportation cost problem. Our algorithmic
framework has implications beyond the MapReduce model. For example it yields a
new algorithm for computing EMD cost in the plane in near-linear time,
. We note that while recently Sharathkumar and Agarwal
developed a near-linear time algorithm for -approximating EMD,
our algorithm is fundamentally different, and, for example, also solves the
transportation (cost) problem, raised as an open question in their work.
Furthermore, our algorithm immediately gives a -approximation
algorithm with space in the streaming-with-sorting model with
passes. As such, it is tempting to conjecture that the
parallel models may also constitute a concrete playground in the quest for
efficient algorithms for EMD (and other similar problems) in the vanilla
streaming model, a well-known open problem
Efficient k-means++ approximation with MapReduce
PublishedJournal Articlek-means is undoubtedly one of the most popular clustering algorithms owing to its simplicity and efficiency. However, this algorithm is highly sensitive to the chosen initial centers and thus a proper initialization is crucial for obtaining an ideal solution. To address this problem, k-means++ is proposed to sequentially choose the centers so as to achieve a solution that is provably close to the optimal one. However, due to its weak scalability, k-means++ becomes inefficient as the size of data increases. To improve its scalability and efficiency, this paper presents MapReduce k-means++ method which can drastically reduce the number of MapReduce jobs by using only one MapReduce job to obtain k centers. The k-means++ initialization algorithm is executed in the Mapper phase and the weighted k-means++ initialization algorithm is run in the Reducer phase. As this new MapReduce k-means++ method replaces the iterations among multiple machines with a single machine, it can reduce the communication and I/O costs significantly. We also prove that the proposed MapReduce k-means++ method obtains O(α2) approximation to the optimal solution of k-means. To reduce the expensive distance computation of the proposed method, we further propose a pruning strategy that can greatly avoid a large number of redundant distance computations. Extensive experiments on real and synthetic data are conducted and the performance results indicate that the proposed MapReduce k-means++ method is much more efficient and can achieve a good approximation.This work was supported by the National Science Foundation for Distinguished Young Scholars of China under Grant No. of 61225010, National Nature Science Foundation of China (Nos. 61173162, 61173165, 61370199, 61300187, 61300189, and 61370198), New Century Excellent Talents (No. NCET-10-0095), the Fundamental Research Funds for the Central Universities (Nos. 2013QN044 and 2012TD008)
Recommended from our members
Efficient clustering techniques on Hadoop and Spark
Software services based on large-scale distributed systems
demand continuous and decentralised solutions for achieving system con-
sistency and providing operational monitoring. Epidemic data aggregation
algorithms provide decentralised, scalable and fault-tolerant solutions
that can be used for system-wide tasks such as global state determination,
monitoring and consensus. Existing continuous epidemic algorithms either
periodically restart at fixed epochs or apply changes in the system state
instantly producing less accurate approximation. This work introduces
an innovative mechanism without fixed epochs that monitors the system
state and restarts upon the detection of the system convergence or diver-
gence. The mechanism makes correct aggregation with an approximation
error as small as desired. The proposed solution is validated and analysed
by means of simulations under static and dynamic network conditions
Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially.
Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k-center variant which, given a set S of points from some metric space and a parameter k0, the algorithms yield solutions whose approximation ratios are a mere additive term \u3f5 away from those achievable by the best known polynomial-time sequential algorithms, a result that substantially improves upon the state of the art. Our algorithms are rather simple and adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Specifically, our analysis shows that the algorithms become very space-efficient for the important case of small (constant) D. These theoretical results are complemented with a set of experiments on real-world and synthetic datasets of up to over a billion points, which show that our algorithms yield better quality solutions over the state of the art while featuring excellent scalability, and that they also lend themselves to sequential implementations much faster than existing ones
Efficient Processing of k Nearest Neighbor Joins using MapReduce
k nearest neighbor join (kNN join), designed to find k nearest neighbors from
a dataset S for every object in another dataset R, is a primitive operation
widely adopted by many data mining applications. As a combination of the k
nearest neighbor query and the join operation, kNN join is an expensive
operation. Given the increasing volume of data, it is difficult to perform a
kNN join on a centralized machine efficiently. In this paper, we investigate
how to perform kNN join using MapReduce which is a well-accepted framework for
data-intensive applications over clusters of computers. In brief, the mappers
cluster objects into groups; the reducers perform the kNN join on each group of
objects separately. We design an effective mapping mechanism that exploits
pruning rules for distance filtering, and hence reduces both the shuffling and
computational costs. To reduce the shuffling cost, we propose two approximate
algorithms to minimize the number of replicas. Extensive experiments on our
in-house cluster demonstrate that our proposed methods are efficient, robust
and scalable.Comment: VLDB201
Accurate MapReduce Algorithms for k-Median and k-Means in General Metric Spaces
Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k-median and k-means variants which, given a set P of points from a metric space and a parameter k<|P|, require to identify a set S of k centers minimizing, respectively, the sum of the distances and of the squared distances of all points in P from their closest centers. Our specific focus is on general metric spaces, for which it is reasonable to require that the centers belong to the input set (i.e., S subseteq P). We present coreset-based 3-round distributed approximation algorithms for the above problems using the MapReduce computational model. The algorithms are rather simple and obliviously adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Remarkably, the algorithms attain approximation ratios that can be made arbitrarily close to those achievable by the best known polynomial-time sequential approximations, and they are very space efficient for small D, requiring local memory sizes substantially sublinear in the input size. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance guarantees in general metric spaces
- …