1 research outputs found

    Efficiently Computing Exact Set Similarity Joins

    Full text link
    Set similarity join, which finds all the similar set pairs from two collections of sets, is a fundamental problem with a wide range of applications including data cleaning, information extraction, machine learning and so on. In this thesis, we study three important problems regarding set similarity join.Firstly, we develop algorithms for static exact set similarity join problem. The existing solutions for exact set similarity join follow a filtering-verification framework, which generates a list of candidate pairs through scanning indexes in the filtering phase, and reports those similar pairs in the verification phase. Though much research has been conducted on this problem, set relations have not been well studied on improving the algorithm efficiency through computational cost sharing. Therefore, we explore the set relations in different levels to reduce the overall computational costs. It has been shown that most of the computational time is spent on the filtering phase. Thus we explore index-level set relations to reduce the filtering cost while keeping the same filtering power. We achieve this by grouping related sets into blocks in the index and skipping useless index probes in joins. Also, we explore answer-level set relations to further improve the algorithm based on the intuition that if two sets are similar, their answers may have a large overlap. We conduct extensive performance studies using 21 real datasets with various data properties from a wide range of domains. The experimental results demonstrate that our algorithm outperforms all the existing algorithms across all datasets.Secondly, we study the dynamic set similarity join problem considering that the data are usually updated dynamically in real applications. No previous study has been found on this problem. We design efficient algorithms to incrementally update the join result when any element in the sets is updated. Computational cost is analyzed on proposed algorithms. Comprehensive experiments on real-life datasets validate the effectiveness and efficiency of our algorithms.Finally, we investigate parallel set similarity join on multicore CPUs to support set similarity join on large datasets. Up to now, no research has been found on solving set similarity join problem in terms of Jaccard/Cosine similarity on a single machine with multicore CPUs. We propose record-level and element-level parallel algorithms and conduct experiments on large real-life datasets to show effectiveness and scalability of our algorithms
    corecore