88,906 research outputs found

    Engineering a Simplified 0-Bit Consistent Weighted Sampling

    Full text link
    The Min-Hashing approach to sketching has become an important tool in data analysis, information retrial, and classification. To apply it to real-valued datasets, the ICWS algorithm has become a seminal approach that is widely used, and provides state-of-the-art performance for this problem space. However, ICWS suffers a computational burden as the sketch size K increases. We develop a new Simplified approach to the ICWS algorithm, that enables us to obtain over 20x speedups compared to the standard algorithm. The veracity of our approach is demonstrated empirically on multiple datasets and scenarios, showing that our new Simplified CWS obtains the same quality of results while being an order of magnitude faster

    Consistent Weighted Sampling Made Fast, Small, and Easy

    Full text link
    Document sketching using Jaccard similarity has been a workable effective technique in reducing near-duplicates in Web page and image search results, and has also proven useful in file system synchronization, compression and learning applications. Min-wise sampling can be used to derive an unbiased estimator for Jaccard similarity and taking a few hundred independent consistent samples leads to compact sketches which provide good estimates of pairwise-similarity. Subsequent works extended this technique to weighted sets and show how to produce samples with only a constant number of hash evaluations for any element, independent of its weight. Another improvement by Li et al. shows how to speedup sketch computations by computing many (near-)independent samples in one shot. Unfortunately this latter improvement works only for the unweighted case. In this paper we give a simple, fast and accurate procedure which reduces weighted sets to unweighted sets with small impact on the Jaccard similarity. This leads to compact sketches consisting of many (near-)independent weighted samples which can be computed with just a small constant number of hash function evaluations per weighted element. The size of the produced unweighted set is furthermore a tunable parameter which enables us to run the unweighted scheme of Li et al. in the regime where it is most efficient. Even when the sets involved are unweighted, our approach gives a simple solution to the densification problem that other works attempted to address. Unlike previously known schemes, ours does not result in an unbiased estimator. However, we prove that the bias introduced by our reduction is negligible and that the standard deviation is comparable to the unweighted case. We also empirically evaluate our scheme and show that it gives significant gains in computational efficiency, without any measurable loss in accuracy

    A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

    Full text link
    Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compressed versions, b-bit MinHash and Odd Sketch, can significantly reduce the memory usage of the original MinHash method, especially for estimating high similarities (i.e., similarities around 1). Although MinHash can be applied to static sets as well as streaming sets, of which elements are given in a streaming fashion and cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd Sketch fail to deal with streaming data. To solve this problem, we design a memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard similarities in streaming sets. Compared to MinHash, our method uses smaller sized registers (each register consists of less than 7 bits) to build a compact sketch for each set. We also provide a simple yet accurate estimator for inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive formulas for bounding the estimation error and determine the smallest necessary memory usage (i.e., the number of registers used for a MaxLogHash sketch) for the desired accuracy. We conduct experiments on a variety of datasets, and experimental results show that our method MaxLogHash is about 5 times more memory efficient than MinHash with the same accuracy and computational cost for estimating high similarities
    • …
    corecore