396 research outputs found
Efficient Processing of k Nearest Neighbor Joins using MapReduce
k nearest neighbor join (kNN join), designed to find k nearest neighbors from
a dataset S for every object in another dataset R, is a primitive operation
widely adopted by many data mining applications. As a combination of the k
nearest neighbor query and the join operation, kNN join is an expensive
operation. Given the increasing volume of data, it is difficult to perform a
kNN join on a centralized machine efficiently. In this paper, we investigate
how to perform kNN join using MapReduce which is a well-accepted framework for
data-intensive applications over clusters of computers. In brief, the mappers
cluster objects into groups; the reducers perform the kNN join on each group of
objects separately. We design an effective mapping mechanism that exploits
pruning rules for distance filtering, and hence reduces both the shuffling and
computational costs. To reduce the shuffling cost, we propose two approximate
algorithms to minimize the number of replicas. Extensive experiments on our
in-house cluster demonstrate that our proposed methods are efficient, robust
and scalable.Comment: VLDB201
Enhancing SpatialHadoop with Closest Pair Queries
Given two datasets P and Q, the K Closest Pair Query (KCPQ) finds the K closest pairs of objects from P ×Q. It is an operation widely adopted by many spatial and GIS applications. As a combination of the K Nearest Neighbor (KNN) and the spatial join queries, KCPQ is an expensive operation. Given the increasing volume of spatial data, it is difficult to perform a KCPQ on a centralized machine efficiently. For this reason, this paper addresses the problem of computing the KCPQ on big spatial datasets in SpatialHadoop, an extension of Hadoop that supports spatial operations efficiently, and proposes a novel algorithm in SpatialHadoop to perform efficient parallel KCPQ on large-scale spatial datasets. We have evaluated the performance of the algorithm in several situations with big synthetic and real-world datasets. The experiments have demonstrated the efficiency and scalability of our proposal
kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data
The k-Nearest Neighbors classifier is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this method to handle large-scale data. However, their performance can be further improved with new designs that fit with newly arising technologies.
In this work we provide a new solution to perform an exact k-nearest neighbor classification based on Spark. We take advantage of its in-memory operations to classify big amounts of unseen cases against a big training dataset. The map phase computes the k-nearest neighbors in different training data splits. Afterwards, multiple reducers process the definitive neighbors from the list obtained in the map phase. The key point of this proposal lies on the management of the test set, keeping it in memory when possible. Otherwise, it is split into a minimum number of pieces, applying a MapReduce per chunk, using the caching skills of Spark to reuse the previously partitioned training set. In our experiments we study the differences between Hadoop and Spark implementations with datasets up to 11 million instances, showing the scaling-up capabilities of the proposed approach. As a result of this work an open-source Spark package is available
- …