7 research outputs found
Leveraging set relations in exact set similarity join
© 2017 VLDB. Exact set similarity join, which finds all the similar set pairs from two collections of sets, is a fundamental problem with a wide range of applications. The existing solutions for set similarity join follow a filtering-verification framework, which generates a list of candidate pairs through scanning indexes in the filtering phase, and reports those similar pairs in the verification phase. Though much research has been conducted on this problem, set relations, which we find out is quite effective on improving the algorithm effciency through computational cost sharing, have never been studied. Therefore, in this paper, instead of considering each set individually, we explore the set relations in different levels to reduce the overall computational costs. First, it has been shown that most of the computational time is spent on the filtering phase, which can be quadratic to the number of sets in the worst case for the existing solutions. Thus we explore index-level set relations to reduce the filtering cost to be linear to the size of the input while keeping the same filtering power. We achieve this by grouping related sets into blocks in the index and skipping useless index probes in joins. Second, we explore answer-level set relations to further improve the algorithm based on the intuition that if two sets are similar, their answers may have a large overlap. We derive an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost. Finally, we conduct extensive performance studies using 21 real datasets with various data properties from a wide range of domains. The experimental results demonstrate that our algorithm outperforms all the existing algorithms across all datasets and can achieve more than an order of magnitude speedup against the stateof-the-art algorithms
GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search
In this paper, we study the problem of approximate containment similarity
search. Given two records Q and X, the containment similarity between Q and X
with respect to Q is |Q intersect X|/ |Q|. Given a query record Q and a set of
records S, the containment similarity search finds a set of records from S
whose containment similarity regarding Q are not less than the given threshold.
This problem has many important applications in commercial and scientific
fields such as record matching and domain search. Existing solution relies on
the asymmetric LSH method by transforming the containment similarity to
well-studied Jaccard similarity. In this paper, we use a different framework by
transforming the containment similarity to set intersection. We propose a novel
augmented KMV sketch technique, namely GB-KMV, which is data-dependent and can
achieve a good trade-off between the sketch size and the accuracy. We provide a
set of theoretical analysis to underpin the proposed augmented KMV sketch
technique, and show that it outperforms the state-of-the-art technique LSH-E in
terms of estimation accuracy under practical assumption. Our comprehensive
experiments on real-life datasets verify that GB-KMV is superior to LSH-E in
terms of the space-accuracy trade-off, time-accuracy trade-off, and the sketch
construction time. For instance, with similar estimation accuracy (F-1 score),
GB-KMV is over 100 times faster than LSH-E on some real-life dataset
Leveraging Set Relations in Exact Set Similarity Join
ABSTRACT Exact set similarity join, which finds all the similar set pairs from two collections of sets, is a fundamental problem with a wide range of applications. The existing solutions for set similarity join follow a filtering-verification framework, which generates a list of candidate pairs through scanning indexes in the filtering phase, and reports those similar pairs in the verification phase. Though much research has been conducted on this problem, set relations, which we find out is quite effective on improving the algorithm efficiency through computational cost sharing, have never been studied. Therefore, in this paper, instead of considering each set individually, we explore the set relations in different levels to reduce the overall computational costs. First, it has been shown that most of the computational time is spent on the filtering phase, which can be quadratic to the number of sets in the worst case for the existing solutions. Thus we explore index-level set relations to reduce the filtering cost to be linear to the size of the input while keeping the same filtering power. We achieve this by grouping related sets into blocks in the index and skipping useless index probes in joins. Second, we explore answer-level set relations to further improve the algorithm based on the intuition that if two sets are similar, their answers may have a large overlap. We derive an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost. Finally, we conduct extensive performance studies using 21 real datasets with various data properties from a wide range of domains. The experimental results demonstrate that our algorithm outperforms all the existing algorithms across all datasets and can achieve more than an order of magnitude speedup against the stateof-the-art algorithms