225 research outputs found

    A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

    Full text link
    Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compressed versions, b-bit MinHash and Odd Sketch, can significantly reduce the memory usage of the original MinHash method, especially for estimating high similarities (i.e., similarities around 1). Although MinHash can be applied to static sets as well as streaming sets, of which elements are given in a streaming fashion and cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd Sketch fail to deal with streaming data. To solve this problem, we design a memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard similarities in streaming sets. Compared to MinHash, our method uses smaller sized registers (each register consists of less than 7 bits) to build a compact sketch for each set. We also provide a simple yet accurate estimator for inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive formulas for bounding the estimation error and determine the smallest necessary memory usage (i.e., the number of registers used for a MaxLogHash sketch) for the desired accuracy. We conduct experiments on a variety of datasets, and experimental results show that our method MaxLogHash is about 5 times more memory efficient than MinHash with the same accuracy and computational cost for estimating high similarities

    Author recognition using Locality Sensitive Hashing & Alergia (Stochastic Finite Automata)

    Get PDF
    In today’s world data grows very fast. It is difficult to answer questions like 1) Is the content completely written by this author, 2) Did he get few sentences or pages from another author, 3) Is there any way to identify actual author. There are many plagiarism software’s available in the market which identify duplicate content. It doesn’t understand writing pattern involved. There is always a necessity to make an effort to find the original author. Locality sensitive hashing is one such standard for applying hashing to recognize authors writing pattern

    Hardness of Bichromatic Closest Pair with Jaccard Similarity

    Get PDF
    Consider collections A\mathcal{A} and B\mathcal{B} of red and blue sets, respectively. Bichromatic Closest Pair is the problem of finding a pair from A×B\mathcal{A}\times \mathcal{B} that has similarity higher than a given threshold according to some similarity measure. Our focus here is the classic Jaccard similarity ∣a∩b∣/∣a∪b∣|\textbf{a}\cap \textbf{b}|/|\textbf{a}\cup \textbf{b}| for (a,b)∈A×B(\textbf{a},\textbf{b})\in \mathcal{A}\times \mathcal{B}. We consider the approximate version of the problem where we are given thresholds j1>j2j_1>j_2 and wish to return a pair from A×B\mathcal{A}\times \mathcal{B} that has Jaccard similarity higher than j2j_2 if there exists a pair in A×B\mathcal{A}\times \mathcal{B} with Jaccard similarity at least j1j_1. The classic locality sensitive hashing (LSH) algorithm of Indyk and Motwani (STOC '98), instantiated with the MinHash LSH function of Broder et al., solves this problem in O~(n2−δ)\tilde O(n^{2-\delta}) time if j1≥j21−δj_1\ge j_2^{1-\delta}. In particular, for δ=Ω(1)\delta=\Omega(1), the approximation ratio j1/j2=1/j2δj_1/j_2=1/j_2^{\delta} increases polynomially in 1/j21/j_2. In this paper we give a corresponding hardness result. Assuming the Orthogonal Vectors Conjecture (OVC), we show that there cannot be a general solution that solves the Bichromatic Closest Pair problem in O(n2−Ω(1))O(n^{2-\Omega(1)}) time for j1/j2=1/j2o(1)j_1/j_2=1/j_2^{o(1)}. Specifically, assuming OVC, we prove that for any δ>0\delta>0 there exists an ε>0\varepsilon>0 such that Bichromatic Closest Pair with Jaccard similarity requires time Ω(n2−δ)\Omega(n^{2-\delta}) for any choice of thresholds j2<j1<1−δj_2<j_1<1-\delta, that satisfy j1≤j21−εj_1\le j_2^{1-\varepsilon}

    Analysis of SparseHash: an efficient embedding of set-similarity via sparse projections

    Get PDF
    Embeddings provide compact representations of signals in order to perform efficient inference in a wide variety of tasks. In particular, random projections are common tools to construct Euclidean distance-preserving embeddings, while hashing techniques are extensively used to embed set-similarity metrics, such as the Jaccard coefficient. In this letter, we theoretically prove that a class of random projections based on sparse matrices, called SparseHash, can preserve the Jaccard coefficient between the supports of sparse signals, which can be used to estimate set similarities. Moreover, besides the analysis, we provide an efficient implementation and we test the performance in several numerical experiments, both on synthetic and real datasets.Comment: 25 pages, 6 figure

    Scalable and Robust Set Similarity Join

    Get PDF
    Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be important --- indeed, where the exact set similarity join is itself only an approximation of the desired result set. We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100%, and show theoretically and empirically that it significantly improves on existing methods. The present state-of-the-art exact methods are based on prefix-filtering, the performance of which depends on the data set having many rare tokens. Our method is robust against the absence of such structure in the data. At 90% recall our algorithm is often more than an order of magnitude faster than state-of-the-art exact methods, depending on how well a data set lends itself to prefix filtering. Our experiments on benchmark data sets also show that the method is several times faster than comparable approximate methods. Our algorithm makes use of recent theoretical advances in high-dimensional sketching and indexing that we believe to be of wider relevance to the data engineering community
    • …
    corecore