Search CORE

225 research outputs found

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Author: Broder A.
Durand Marianne
Flajolet Philippe
Li Ping
Li Ping
Shrivastava Anshumali
Shrivastava Anshumali
Shrivastava Anshumali
Shrivastava Anshumali
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/05/2019
Field of study

Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compressed versions, b-bit MinHash and Odd Sketch, can significantly reduce the memory usage of the original MinHash method, especially for estimating high similarities (i.e., similarities around 1). Although MinHash can be applied to static sets as well as streaming sets, of which elements are given in a streaming fashion and cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd Sketch fail to deal with streaming data. To solve this problem, we design a memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard similarities in streaming sets. Compared to MinHash, our method uses smaller sized registers (each register consists of less than 7 bits) to build a compact sketch for each set. We also provide a simple yet accurate estimator for inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive formulas for bounding the estimation error and determine the smallest necessary memory usage (i.e., the number of registers used for a MaxLogHash sketch) for the desired accuracy. We conduct experiments on a variety of datasets, and experimental results show that our method MaxLogHash is about 5 times more memory efficient than MinHash with the same accuracy and computational cost for estimating high similarities

arXiv.org e-Print Archive

Author recognition using Locality Sensitive Hashing & Alergia (Stochastic Finite Automata)

Author: Sandela Prashanth
Publication venue: SJSU ScholarWorks
Publication date: 01/10/2015
Field of study

In today’s world data grows very fast. It is difficult to answer questions like 1) Is the content completely written by this author, 2) Did he get few sentences or pages from another author, 3) Is there any way to identify actual author. There are many plagiarism software’s available in the market which identify duplicate content. It doesn’t understand writing pattern involved. There is always a necessity to make an effort to find the original author. Locality sensitive hashing is one such standard for applying hashing to recognize authors writing pattern

SJSU ScholarWorks

Hardness of Bichromatic Closest Pair with Jaccard Similarity

Author: Nielsen Nina Mesing Stausholm
Pagh Rasmus
Thorup Mikkel
Publication venue
Publication date: 01/01/2019
Field of study

Consider collections

\mathcal{A}

and

\mathcal{B}

of red and blue sets, respectively. Bichromatic Closest Pair is the problem of finding a pair from

\mathcal{A}\times \mathcal{B}

that has similarity higher than a given threshold according to some similarity measure. Our focus here is the classic Jaccard similarity

|\textbf{a}\cap \textbf{b}|/|\textbf{a}\cup \textbf{b}|

for

(\textbf{a},\textbf{b})\in \mathcal{A}\times \mathcal{B}

. We consider the approximate version of the problem where we are given thresholds

j_1>j_2

and wish to return a pair from

\mathcal{A}\times \mathcal{B}

that has Jaccard similarity higher than

j_2

if there exists a pair in

\mathcal{A}\times \mathcal{B}

with Jaccard similarity at least

j_1

. The classic locality sensitive hashing (LSH) algorithm of Indyk and Motwani (STOC '98), instantiated with the MinHash LSH function of Broder et al., solves this problem in

\tilde O(n^{2-\delta})

time if

j_1\ge j_2^{1-\delta}

. In particular, for

\delta=\Omega(1)

, the approximation ratio

j_1/j_2=1/j_2^{\delta}

increases polynomially in

1/j_2

. In this paper we give a corresponding hardness result. Assuming the Orthogonal Vectors Conjecture (OVC), we show that there cannot be a general solution that solves the Bichromatic Closest Pair problem in

O(n^{2-\Omega(1)})

time for

j_1/j_2=1/j_2^{o(1)}

. Specifically, assuming OVC, we prove that for any

\delta>0

there exists an

\varepsilon>0

such that Bichromatic Closest Pair with Jaccard similarity requires time

\Omega(n^{2-\delta})

for any choice of thresholds

j_2<j_1<1-\delta

, that satisfy

j_1\le j_2^{1-\varepsilon}

arXiv.org e-Print Archive

Copenhagen University Research Information System

Dagstuhl Research Online Publication Server

Analysis of SparseHash: an efficient embedding of set-similarity via sparse projections

Author: Bianchi Tiziano
Fosson Sophie Marie
Magli Enrico
Ravazzi Chiara
Valsesia Diego
Publication venue
Publication date: 01/01/2019
Field of study

Embeddings provide compact representations of signals in order to perform efficient inference in a wide variety of tasks. In particular, random projections are common tools to construct Euclidean distance-preserving embeddings, while hashing techniques are extensively used to embed set-similarity metrics, such as the Jaccard coefficient. In this letter, we theoretically prove that a class of random projections based on sparse matrices, called SparseHash, can preserve the Jaccard coefficient between the supports of sparse signals, which can be used to estimate set similarities. Moreover, besides the analysis, we provide an efficient implementation and we test the performance in several numerical experiments, both on synthetic and real datasets.Comment: 25 pages, 6 figure

arXiv.org e-Print Archive

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Scalable and Robust Set Similarity Join

Author: Christiani Tobias Lybecker
Pagh Rasmus
Sivertsen Johan von Tangen
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be important --- indeed, where the exact set similarity join is itself only an approximation of the desired result set. We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100%, and show theoretically and empirically that it significantly improves on existing methods. The present state-of-the-art exact methods are based on prefix-filtering, the performance of which depends on the data set having many rare tokens. Our method is robust against the absence of such structure in the data. At 90% recall our algorithm is often more than an order of magnitude faster than state-of-the-art exact methods, depending on how well a data set lends itself to prefix filtering. Our experiments on benchmark data sets also show that the method is several times faster than comparable approximate methods. Our algorithm makes use of recent theoretical advances in high-dimensional sketching and indexing that we believe to be of wider relevance to the data engineering community

arXiv.org e-Print Archive