1 research outputs found
A simple, efficient, parallelizable algorithm for approximated nearest neighbors
The use of the join operator in metric spaces leads to what is
known as a similarity join, where objects of two datasets are paired if they
are somehow similar. We propose an heuristic that solves the 1-NN selfsimilarity join, that is, a similarity join of a dataset with itself, that brings
together each element with its nearest neighbor within the same dataset.
Solving the problem using a simple brute-force algorithm requires O(n
2
)
distance calculations, since it requires to compare every element against all
others. We propose a simple divide-and-conquer algorithm that gives an
approximated solution for the self-similarity join that computes only O(n
3
2 )
distances. We show how the algorithm can be easily modified in order to
improve the precision up to 31% (i.e., the percentage of correctly found
1-NNs) and such that 79% of the results are within the 10-NN, with no
significant extra distance computations. We present how the algorithm can
be executed in parallel and prove that using Θ(
√
n) processors, the total
execution takes linear time. We end discussing ways in which the algorithm
can be improved in the future