225 research outputs found
A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets
Estimating set similarity and detecting highly similar sets are fundamental
problems in areas such as databases, machine learning, and information
retrieval. MinHash is a well-known technique for approximating Jaccard
similarity of sets and has been successfully used for many applications such as
similarity search and large scale learning. Its two compressed versions, b-bit
MinHash and Odd Sketch, can significantly reduce the memory usage of the
original MinHash method, especially for estimating high similarities (i.e.,
similarities around 1). Although MinHash can be applied to static sets as well
as streaming sets, of which elements are given in a streaming fashion and
cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd
Sketch fail to deal with streaming data. To solve this problem, we design a
memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard
similarities in streaming sets. Compared to MinHash, our method uses smaller
sized registers (each register consists of less than 7 bits) to build a compact
sketch for each set. We also provide a simple yet accurate estimator for
inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive
formulas for bounding the estimation error and determine the smallest necessary
memory usage (i.e., the number of registers used for a MaxLogHash sketch) for
the desired accuracy. We conduct experiments on a variety of datasets, and
experimental results show that our method MaxLogHash is about 5 times more
memory efficient than MinHash with the same accuracy and computational cost for
estimating high similarities
Author recognition using Locality Sensitive Hashing & Alergia (Stochastic Finite Automata)
In today’s world data grows very fast. It is difficult to answer questions like 1) Is the content completely written by this author, 2) Did he get few sentences or pages from another author, 3) Is there any way to identify actual author. There are many plagiarism software’s available in the market which identify duplicate content. It doesn’t understand writing pattern involved. There is always a necessity to make an effort to find the original author. Locality sensitive hashing is one such standard for applying hashing to recognize authors writing pattern
Hardness of Bichromatic Closest Pair with Jaccard Similarity
Consider collections and of red and blue sets,
respectively. Bichromatic Closest Pair is the problem of finding a pair from
that has similarity higher than a given
threshold according to some similarity measure. Our focus here is the classic
Jaccard similarity
for .
We consider the approximate version of the problem where we are given
thresholds and wish to return a pair from that has Jaccard similarity higher than if there exists a
pair in with Jaccard similarity at least .
The classic locality sensitive hashing (LSH) algorithm of Indyk and Motwani
(STOC '98), instantiated with the MinHash LSH function of Broder et al., solves
this problem in time if . In
particular, for , the approximation ratio
increases polynomially in .
In this paper we give a corresponding hardness result. Assuming the
Orthogonal Vectors Conjecture (OVC), we show that there cannot be a general
solution that solves the Bichromatic Closest Pair problem in
time for . Specifically, assuming
OVC, we prove that for any there exists an such that
Bichromatic Closest Pair with Jaccard similarity requires time
for any choice of thresholds , that
satisfy
Analysis of SparseHash: an efficient embedding of set-similarity via sparse projections
Embeddings provide compact representations of signals in order to perform
efficient inference in a wide variety of tasks. In particular, random
projections are common tools to construct Euclidean distance-preserving
embeddings, while hashing techniques are extensively used to embed
set-similarity metrics, such as the Jaccard coefficient. In this letter, we
theoretically prove that a class of random projections based on sparse
matrices, called SparseHash, can preserve the Jaccard coefficient between the
supports of sparse signals, which can be used to estimate set similarities.
Moreover, besides the analysis, we provide an efficient implementation and we
test the performance in several numerical experiments, both on synthetic and
real datasets.Comment: 25 pages, 6 figure
Scalable and Robust Set Similarity Join
Set similarity join is a fundamental and well-studied database operator. It
is usually studied in the exact setting where the goal is to compute all pairs
of sets that exceed a given similarity threshold (measured e.g. as Jaccard
similarity). But set similarity join is often used in settings where 100%
recall may not be important --- indeed, where the exact set similarity join is
itself only an approximation of the desired result set.
We present a new randomized algorithm for set similarity join that can
achieve any desired recall up to 100%, and show theoretically and empirically
that it significantly improves on existing methods. The present
state-of-the-art exact methods are based on prefix-filtering, the performance
of which depends on the data set having many rare tokens. Our method is robust
against the absence of such structure in the data. At 90% recall our algorithm
is often more than an order of magnitude faster than state-of-the-art exact
methods, depending on how well a data set lends itself to prefix filtering. Our
experiments on benchmark data sets also show that the method is several times
faster than comparable approximate methods. Our algorithm makes use of recent
theoretical advances in high-dimensional sketching and indexing that we believe
to be of wider relevance to the data engineering community
- …