8 research outputs found
Fast Similarity Sketching
We consider the Similarity Sketching problem: Given a universe we want a random function mapping subsets into vectors of size , such that similarity is preserved. More
precisely: Given sets , define and
. We want to have , where
and furthermore to have strong concentration
guarantees (i.e. Chernoff-style bounds) for . This is a fundamental problem
which has found numerous applications in data mining, large-scale
classification, computer vision, similarity search, etc. via the classic
MinHash algorithm. The vectors are also called sketches.
The seminal MinHash algorithm uses random hash functions
, and stores as the sketch of . The main drawback of MinHash is,
however, its running time, and finding a sketch with similar
properties and faster running time has been the subject of several papers.
Addressing this, Li et al. [NIPS'12] introduced one permutation hashing (OPH),
which creates a sketch of size in time, but with the drawback
that possibly some of the entries are "empty" when . One could
argue that sketching is not necessary in this case, however the desire in most
applications is to have one sketching procedure that works for sets of all
sizes. Therefore, filling out these empty entries is the subject of several
follow-up papers initiated by Shrivastava and Li [ICML'14]. However, these
"densification" schemes fail to provide good concentration bounds exactly in
the case , where they are needed. (continued...
Practical Hash Functions for Similarity Estimation and Dimensionality Reduction
Hashing is a basic tool for dimensionality reduction employed in several
aspects of machine learning. However, the perfomance analysis is often carried
out under the abstract assumption that a truly random unit cost hash function
is used, without concern for which concrete hash function is employed. The
concrete hash function may work fine on sufficiently random input. The question
is if it can be trusted in the real world when faced with more structured
input.
In this paper we focus on two prominent applications of hashing, namely
similarity estimation with the one permutation hashing (OPH) scheme of Li et
al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of
which have found numerous applications, i.e. in approximate near-neighbour
search with LSH and large-scale classification with SVM.
We consider mixed tabulation hashing of Dahlgaard et al.[FOCS'15] which was
proved to perform like a truly random hash function in many applications,
including OPH. Here we first show improved concentration bounds for FH with
truly random hashing and then argue that mixed tabulation performs similar for
sparse input. Our main contribution, however, is an experimental comparison of
different hashing schemes when used inside FH, OPH, and LSH.
We find that mixed tabulation hashing is almost as fast as the
multiply-mod-prime scheme ax+b mod p. Mutiply-mod-prime is guaranteed to work
well on sufficiently random data, but we demonstrate that in the above
applications, it can lead to bias and poor concentration on both real-world and
synthetic data. We also compare with the popular MurmurHash3, which has no
proven guarantees. Mixed tabulation and MurmurHash3 both perform similar to
truly random hashing in our experiments. However, mixed tabulation is 40%
faster than MurmurHash3, and it has the proven guarantee of good performance on
all possible input.Comment: Preliminary version of this paper will appear at NIPS 201
Fast and Powerful Hashing using Tabulation
Randomized algorithms are often enjoyed for their simplicity, but the hash
functions employed to yield the desired probabilistic guarantees are often too
complicated to be practical. Here we survey recent results on how simple
hashing schemes based on tabulation provide unexpectedly strong guarantees.
Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as
consisting of characters and we have precomputed character tables
mapping characters to random hash values. A key
is hashed to . This schemes is
very fast with character tables in cache. While simple tabulation is not even
4-independent, it does provide many of the guarantees that are normally
obtained via higher independence, e.g., linear probing and Cuckoo hashing.
Next we consider twisted tabulation where one input character is "twisted" in
a simple way. The resulting hash function has powerful distributional
properties: Chernoff-Hoeffding type tail bounds and a very small bias for
min-wise hashing. This also yields an extremely fast pseudo-random number
generator that is provably good for many classic randomized algorithms and
data-structures.
Finally, we consider double tabulation where we compose two simple tabulation
functions, applying one to the output of the other, and show that this yields
very high independence in the classic framework of Carter and Wegman [1977]. In
fact, w.h.p., for a given set of size proportional to that of the space
consumed, double tabulation gives fully-random hashing. We also mention some
more elaborate tabulation schemes getting near-optimal independence for given
time and space.
While these tabulation schemes are all easy to implement and use, their
analysis is not
Fast hashing with Strong Concentration Bounds
Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on
simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style
concentration bounds on hash based sums, e.g., the number of balls/keys hashing
to a given bin, but under some quite severe restrictions on the expected values
of these sums. The basic idea in tabulation hashing is to view a key as
consisting of characters, e.g., a 64-bit key as characters of
8-bits. The character domain should be small enough that character
tables of size fit in fast cache. The schemes then use tables
of this size, so the space of tabulation hashing is . However, the
concentration bounds by Patrascu and Thorup only apply if the expected sums are
.
To see the problem, consider the very simple case where we use tabulation
hashing to throw balls into bins and want to analyse the number of
balls in a given bin. With their concentration bounds, we are fine if ,
for then the expected value is . However, if , as when tossing
unbiased coins, the expected value is for large data sets,
e.g., data sets that do not fit in fast cache.
To handle expectations that go beyond the limits of our small space, we need
a much more advanced analysis of simple tabulation, plus a new tabulation
technique that we call \emph{tabulation-permutation} hashing which is at most
twice as slow as simple tabulation. No other hashing scheme of comparable speed
offers similar Chernoff-style concentration bounds.Comment: 54 pages, 3 figures. An extended abstract appeared at the 52nd Annual
ACM Symposium on Theory of Computing (STOC20
Power of Choices with Simple Tabulation
Suppose that we are to place balls into bins sequentially using the
-choice paradigm: For each ball we are given a choice of bins, according
to hash functions and we place the ball in the least loaded
of these bins breaking ties arbitrarily. Our interest is in the number of balls
in the fullest bin after all balls have been placed.
Azar et al. [STOC'94] proved that when and when the hash functions
are fully random the maximum load is at most
whp (i.e. with probability for any choice of ).
In this paper we suppose that the are simple tabulation hash
functions. Generalising a result by Dahlgaard et al [SODA'16] we show that for
an arbitrary constant the maximum load is whp, and
that expected maximum load is at most . We
further show that by using a simple tie-breaking algorithm introduced by
V\"ocking [J.ACM'03] the expected maximum load drops to where is the rate of growth of the -ary
Fibonacci numbers. Both of these expected bounds match those of the fully
random setting.
The analysis by Dahlgaard et al. relies on a proof by P\u{a}tra\c{s}cu and
Thorup [J.ACM'11] concerning the use of simple tabulation for cuckoo hashing.
We need here a generalisation to hash functions, but the original proof
is an 8-page tour de force of ad-hoc arguments that do not appear to
generalise. Our main technical contribution is a shorter, simpler and more
accessible proof of the result by P\u{a}tra\c{s}cu and Thorup, where the
relevant parts generalise nicely to the analysis of choices.Comment: Accepted at ICALP 201
Load Balancing with Dynamic Set of Balls and Bins
In dynamic load balancing, we wish to distribute balls into bins in an
environment where both balls and bins can be added and removed. We want to
minimize the maximum load of any bin but we also want to minimize the number of
balls and bins affected when adding or removing a ball or a bin. We want a
hashing-style solution where we given the ID of a ball can find its bin
efficiently.
We are given a balancing parameter , where .
With and the current numbers of balls and bins, we want no bin with
load above , referred to as the capacity of the bins.
We present a scheme where we can locate a ball checking bins in expectation. When inserting or deleting a ball, we expect
to move balls, and when inserting or deleting a bin, we expect
to move balls. Previous bounds were off by a factor
.
These bounds are best possible when but for larger , we can do
much better: Let if ,
if , and if . We show that we expect to move balls when
inserting or deleting a ball, and balls when inserting or deleting a
bin.
For the bounds with larger , we first have to resolve a much simpler
probabilistic problem. Place balls in bins of capacity , one ball at
the time. Each ball picks a uniformly random non-full bin. We show that in
expectation and with high probability, the fraction of non-full bins is
. Then the expected number of bins that a new ball would have to
visit to find one that is not full is . As it turns out, we obtain
the same complexity in our more complicated scheme where both balls and bins
can be added and removed.Comment: Accepted at STOC'2