12,129 research outputs found
Scalable and Robust Set Similarity Join
Set similarity join is a fundamental and well-studied database operator. It
is usually studied in the exact setting where the goal is to compute all pairs
of sets that exceed a given similarity threshold (measured e.g. as Jaccard
similarity). But set similarity join is often used in settings where 100%
recall may not be important --- indeed, where the exact set similarity join is
itself only an approximation of the desired result set.
We present a new randomized algorithm for set similarity join that can
achieve any desired recall up to 100%, and show theoretically and empirically
that it significantly improves on existing methods. The present
state-of-the-art exact methods are based on prefix-filtering, the performance
of which depends on the data set having many rare tokens. Our method is robust
against the absence of such structure in the data. At 90% recall our algorithm
is often more than an order of magnitude faster than state-of-the-art exact
methods, depending on how well a data set lends itself to prefix filtering. Our
experiments on benchmark data sets also show that the method is several times
faster than comparable approximate methods. Our algorithm makes use of recent
theoretical advances in high-dimensional sketching and indexing that we believe
to be of wider relevance to the data engineering community
Efficient Processing of k Nearest Neighbor Joins using MapReduce
k nearest neighbor join (kNN join), designed to find k nearest neighbors from
a dataset S for every object in another dataset R, is a primitive operation
widely adopted by many data mining applications. As a combination of the k
nearest neighbor query and the join operation, kNN join is an expensive
operation. Given the increasing volume of data, it is difficult to perform a
kNN join on a centralized machine efficiently. In this paper, we investigate
how to perform kNN join using MapReduce which is a well-accepted framework for
data-intensive applications over clusters of computers. In brief, the mappers
cluster objects into groups; the reducers perform the kNN join on each group of
objects separately. We design an effective mapping mechanism that exploits
pruning rules for distance filtering, and hence reduces both the shuffling and
computational costs. To reduce the shuffling cost, we propose two approximate
algorithms to minimize the number of replicas. Extensive experiments on our
in-house cluster demonstrate that our proposed methods are efficient, robust
and scalable.Comment: VLDB201
Streaming Similarity Self-Join
We introduce and study the problem of computing the similarity self-join in a
streaming context (SSSJ), where the input is an unbounded stream of items
arriving continuously. The goal is to find all pairs of items in the stream
whose similarity is greater than a given threshold. The simplest formulation of
the problem requires unbounded memory, and thus, it is intractable. To make the
problem feasible, we introduce the notion of time-dependent similarity: the
similarity of two items decreases with the difference in their arrival time. By
leveraging the properties of this time-dependent similarity function, we design
two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch
(MB), uses existing index-based filtering techniques for the static version of
the problem, and combines them in a pipeline. The second framework, Streaming
(STR), adds time filtering to the existing indexes, and integrates new
time-based bounds deeply in the working of the algorithms. We also introduce a
new indexing technique (L2), which is based on an existing state-of-the-art
indexing technique (L2AP), but is optimized for the streaming case. Extensive
experiments show that the STR algorithm, when instantiated with the L2 index,
is the most scalable option across a wide array of datasets and parameters
- …