69 research outputs found
Improved Densification of One Permutation Hashing
The existing work on densification of one permutation hashing reduces the
query processing cost of the -parameterized Locality Sensitive Hashing
(LSH) algorithm with minwise hashing, from to merely ,
where is the number of nonzeros of the data vector, is the number of
hashes in each hash table, and is the number of hash tables. While that is
a substantial improvement, our analysis reveals that the existing densification
scheme is sub-optimal. In particular, there is no enough randomness in that
procedure, which affects its accuracy on very sparse datasets.
In this paper, we provide a new densification procedure which is provably
better than the existing scheme. This improvement is more significant for very
sparse datasets which are common over the web. The improved technique has the
same cost of for query processing, thereby making it strictly
preferable over the existing procedure. Experimental evaluations on public
datasets, in the task of hashing based near neighbor search, support our
theoretical findings
Practical Hash Functions for Similarity Estimation and Dimensionality Reduction
Hashing is a basic tool for dimensionality reduction employed in several
aspects of machine learning. However, the perfomance analysis is often carried
out under the abstract assumption that a truly random unit cost hash function
is used, without concern for which concrete hash function is employed. The
concrete hash function may work fine on sufficiently random input. The question
is if it can be trusted in the real world when faced with more structured
input.
In this paper we focus on two prominent applications of hashing, namely
similarity estimation with the one permutation hashing (OPH) scheme of Li et
al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of
which have found numerous applications, i.e. in approximate near-neighbour
search with LSH and large-scale classification with SVM.
We consider mixed tabulation hashing of Dahlgaard et al.[FOCS'15] which was
proved to perform like a truly random hash function in many applications,
including OPH. Here we first show improved concentration bounds for FH with
truly random hashing and then argue that mixed tabulation performs similar for
sparse input. Our main contribution, however, is an experimental comparison of
different hashing schemes when used inside FH, OPH, and LSH.
We find that mixed tabulation hashing is almost as fast as the
multiply-mod-prime scheme ax+b mod p. Mutiply-mod-prime is guaranteed to work
well on sufficiently random data, but we demonstrate that in the above
applications, it can lead to bias and poor concentration on both real-world and
synthetic data. We also compare with the popular MurmurHash3, which has no
proven guarantees. Mixed tabulation and MurmurHash3 both perform similar to
truly random hashing in our experiments. However, mixed tabulation is 40%
faster than MurmurHash3, and it has the proven guarantee of good performance on
all possible input.Comment: Preliminary version of this paper will appear at NIPS 201
Fast Similarity Sketching
We consider the Similarity Sketching problem: Given a universe we want a random function mapping subsets into vectors of size , such that similarity is preserved. More
precisely: Given sets , define and
. We want to have , where
and furthermore to have strong concentration
guarantees (i.e. Chernoff-style bounds) for . This is a fundamental problem
which has found numerous applications in data mining, large-scale
classification, computer vision, similarity search, etc. via the classic
MinHash algorithm. The vectors are also called sketches.
The seminal MinHash algorithm uses random hash functions
, and stores as the sketch of . The main drawback of MinHash is,
however, its running time, and finding a sketch with similar
properties and faster running time has been the subject of several papers.
Addressing this, Li et al. [NIPS'12] introduced one permutation hashing (OPH),
which creates a sketch of size in time, but with the drawback
that possibly some of the entries are "empty" when . One could
argue that sketching is not necessary in this case, however the desire in most
applications is to have one sketching procedure that works for sets of all
sizes. Therefore, filling out these empty entries is the subject of several
follow-up papers initiated by Shrivastava and Li [ICML'14]. However, these
"densification" schemes fail to provide good concentration bounds exactly in
the case , where they are needed. (continued...
Differentially Private One Permutation Hashing and Bin-wise Consistent Weighted Sampling
Minwise hashing (MinHash) is a standard algorithm widely used in the
industry, for large-scale search and learning applications with the binary
(0/1) Jaccard similarity. One common use of MinHash is for processing massive
n-gram text representations so that practitioners do not have to materialize
the original data (which would be prohibitive). Another popular use of MinHash
is for building hash tables to enable sub-linear time approximate near neighbor
(ANN) search. MinHash has also been used as a tool for building large-scale
machine learning systems. The standard implementation of MinHash requires
applying random permutations. In comparison, the method of one permutation
hashing (OPH), is an efficient alternative of MinHash which splits the data
vectors into bins and generates hash values within each bin. OPH is
substantially more efficient and also more convenient to use.
In this paper, we combine the differential privacy (DP) with OPH (as well as
MinHash), to propose the DP-OPH framework with three variants: DP-OPH-fix,
DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted
to deal with empty bins in OPH. A detailed roadmap to the algorithm design is
presented along with the privacy analysis. An analytical comparison of our
proposed DP-OPH methods with the DP minwise hashing (DP-MH) is provided to
justify the advantage of DP-OPH. Experiments on similarity search confirm the
merits of DP-OPH, and guide the choice of the proper variant in different
practical scenarios. Our technique is also extended to bin-wise consistent
weighted sampling (BCWS) to develop a new DP algorithm called DP-BCWS for
non-binary data. Experiments on classification tasks demonstrate that DP-BCWS
is able to achieve excellent utility at around , where
is the standard parameter in the language of -DP
A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets
Estimating set similarity and detecting highly similar sets are fundamental
problems in areas such as databases, machine learning, and information
retrieval. MinHash is a well-known technique for approximating Jaccard
similarity of sets and has been successfully used for many applications such as
similarity search and large scale learning. Its two compressed versions, b-bit
MinHash and Odd Sketch, can significantly reduce the memory usage of the
original MinHash method, especially for estimating high similarities (i.e.,
similarities around 1). Although MinHash can be applied to static sets as well
as streaming sets, of which elements are given in a streaming fashion and
cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd
Sketch fail to deal with streaming data. To solve this problem, we design a
memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard
similarities in streaming sets. Compared to MinHash, our method uses smaller
sized registers (each register consists of less than 7 bits) to build a compact
sketch for each set. We also provide a simple yet accurate estimator for
inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive
formulas for bounding the estimation error and determine the smallest necessary
memory usage (i.e., the number of registers used for a MaxLogHash sketch) for
the desired accuracy. We conduct experiments on a variety of datasets, and
experimental results show that our method MaxLogHash is about 5 times more
memory efficient than MinHash with the same accuracy and computational cost for
estimating high similarities
Consistent Weighted Sampling Made Fast, Small, and Easy
Document sketching using Jaccard similarity has been a workable effective
technique in reducing near-duplicates in Web page and image search results, and
has also proven useful in file system synchronization, compression and learning
applications.
Min-wise sampling can be used to derive an unbiased estimator for Jaccard
similarity and taking a few hundred independent consistent samples leads to
compact sketches which provide good estimates of pairwise-similarity.
Subsequent works extended this technique to weighted sets and show how to
produce samples with only a constant number of hash evaluations for any
element, independent of its weight. Another improvement by Li et al. shows how
to speedup sketch computations by computing many (near-)independent samples in
one shot. Unfortunately this latter improvement works only for the unweighted
case.
In this paper we give a simple, fast and accurate procedure which reduces
weighted sets to unweighted sets with small impact on the Jaccard similarity.
This leads to compact sketches consisting of many (near-)independent weighted
samples which can be computed with just a small constant number of hash
function evaluations per weighted element. The size of the produced unweighted
set is furthermore a tunable parameter which enables us to run the unweighted
scheme of Li et al. in the regime where it is most efficient. Even when the
sets involved are unweighted, our approach gives a simple solution to the
densification problem that other works attempted to address.
Unlike previously known schemes, ours does not result in an unbiased
estimator. However, we prove that the bias introduced by our reduction is
negligible and that the standard deviation is comparable to the unweighted
case. We also empirically evaluate our scheme and show that it gives
significant gains in computational efficiency, without any measurable loss in
accuracy
- …