4 research outputs found
Selectivity estimation on set containment search
© Springer Nature Switzerland AG 2019. In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset S, we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over S. The problem has many important applications in commercial fields and scientific studies. To the best of our knowledge, this is the first work to study this important problem. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketch based approach IL-GKMV. We analyse that the performance of IL-GKMV degrades with the increase of vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance performance, a divide-and-conquer based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. We theoretically analyse the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on 6 real datasets verify the effectiveness and efficiency of our proposed techniques
GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search
In this paper, we study the problem of approximate containment similarity
search. Given two records Q and X, the containment similarity between Q and X
with respect to Q is |Q intersect X|/ |Q|. Given a query record Q and a set of
records S, the containment similarity search finds a set of records from S
whose containment similarity regarding Q are not less than the given threshold.
This problem has many important applications in commercial and scientific
fields such as record matching and domain search. Existing solution relies on
the asymmetric LSH method by transforming the containment similarity to
well-studied Jaccard similarity. In this paper, we use a different framework by
transforming the containment similarity to set intersection. We propose a novel
augmented KMV sketch technique, namely GB-KMV, which is data-dependent and can
achieve a good trade-off between the sketch size and the accuracy. We provide a
set of theoretical analysis to underpin the proposed augmented KMV sketch
technique, and show that it outperforms the state-of-the-art technique LSH-E in
terms of estimation accuracy under practical assumption. Our comprehensive
experiments on real-life datasets verify that GB-KMV is superior to LSH-E in
terms of the space-accuracy trade-off, time-accuracy trade-off, and the sketch
construction time. For instance, with similar estimation accuracy (F-1 score),
GB-KMV is over 100 times faster than LSH-E on some real-life dataset