85 research outputs found
Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence
Simple tabulation dates back to Zobrist in 1970. Keys are viewed as c
characters from some alphabet A. We initialize c tables h_0, ..., h_{c-1}
mapping characters to random hash values. A key x=(x_0, ..., x_{c-1}) is hashed
to h_0[x_0] xor...xor h_{c-1}[x_{c-1}]. The scheme is extremely fast when the
character hash tables h_i are in cache. Simple tabulation hashing is not
4-independent, but we show that if we apply it twice, then we get high
independence. First we hash to intermediate keys that are 6 times longer than
the original keys, and then we hash the intermediate keys to the final hash
values.
The intermediate keys have d=6c characters from A. We can view the hash
function as a degree d bipartite graph with keys on one side, each with edges
to d output characters. We show that this graph has nice expansion properties,
and from that we get that with another level of simple tabulation on the
intermediate keys, the composition is a highly independent hash function. The
independence we get is |A|^{Omega(1/c)}.
Our space is O(c|A|) and the hash function is evaluated in O(c) time. Siegel
[FOCS'89, SICOMP'04] proved that with this space, if the hash function is
evaluated in o(c) time, then the independence can only be o(c), so our
evaluation time is best possible for Omega(c) independence---our independence
is much higher if c=|A|^{o(1)}.
Siegel used O(c)^c evaluation time to get the same independence with similar
space. Siegel's main focus was c=O(1), but we are exponentially faster when
c=omega(1).
Applying our scheme recursively, we can increase our independence to
|A|^{Omega(1)} with o(c^{log c}) evaluation time. Compared with Siegel's scheme
this is both faster and higher independence.
Our scheme is easy to implement, and it does provide realistic
implementations of 100-independent hashing for, say, 32 and 64-bit keys
Approximately Minwise Independence with Twisted Tabulation
A random hash function is -minwise if for any set ,
, and element , .
Minwise hash functions with low bias have widespread applications
within similarity estimation.
Hashing from a universe , the twisted tabulation hashing of
P\v{a}tra\c{s}cu and Thorup [SODA'13] makes lookups in tables of size
. Twisted tabulation was invented to get good concentration for
hashing based sampling. Here we show that twisted tabulation yields -minwise hashing.
In the classic independence paradigm of Wegman and Carter [FOCS'79] -minwise hashing requires -independence [Indyk
SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple
tabulation, using same space and lookups yields -minwise
independence, which is good for large sets, but useless for small sets. Our
analysis uses some of the same methods, but is much cleaner bypassing a
complicated induction argument.Comment: To appear in Proceedings of SWAT 201
Fast and Powerful Hashing using Tabulation
Randomized algorithms are often enjoyed for their simplicity, but the hash
functions employed to yield the desired probabilistic guarantees are often too
complicated to be practical. Here we survey recent results on how simple
hashing schemes based on tabulation provide unexpectedly strong guarantees.
Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as
consisting of characters and we have precomputed character tables
mapping characters to random hash values. A key
is hashed to . This schemes is
very fast with character tables in cache. While simple tabulation is not even
4-independent, it does provide many of the guarantees that are normally
obtained via higher independence, e.g., linear probing and Cuckoo hashing.
Next we consider twisted tabulation where one input character is "twisted" in
a simple way. The resulting hash function has powerful distributional
properties: Chernoff-Hoeffding type tail bounds and a very small bias for
min-wise hashing. This also yields an extremely fast pseudo-random number
generator that is provably good for many classic randomized algorithms and
data-structures.
Finally, we consider double tabulation where we compose two simple tabulation
functions, applying one to the output of the other, and show that this yields
very high independence in the classic framework of Carter and Wegman [1977]. In
fact, w.h.p., for a given set of size proportional to that of the space
consumed, double tabulation gives fully-random hashing. We also mention some
more elaborate tabulation schemes getting near-optimal independence for given
time and space.
While these tabulation schemes are all easy to implement and use, their
analysis is not
Fast hashing with Strong Concentration Bounds
Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on
simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style
concentration bounds on hash based sums, e.g., the number of balls/keys hashing
to a given bin, but under some quite severe restrictions on the expected values
of these sums. The basic idea in tabulation hashing is to view a key as
consisting of characters, e.g., a 64-bit key as characters of
8-bits. The character domain should be small enough that character
tables of size fit in fast cache. The schemes then use tables
of this size, so the space of tabulation hashing is . However, the
concentration bounds by Patrascu and Thorup only apply if the expected sums are
.
To see the problem, consider the very simple case where we use tabulation
hashing to throw balls into bins and want to analyse the number of
balls in a given bin. With their concentration bounds, we are fine if ,
for then the expected value is . However, if , as when tossing
unbiased coins, the expected value is for large data sets,
e.g., data sets that do not fit in fast cache.
To handle expectations that go beyond the limits of our small space, we need
a much more advanced analysis of simple tabulation, plus a new tabulation
technique that we call \emph{tabulation-permutation} hashing which is at most
twice as slow as simple tabulation. No other hashing scheme of comparable speed
offers similar Chernoff-style concentration bounds.Comment: 54 pages, 3 figures. An extended abstract appeared at the 52nd Annual
ACM Symposium on Theory of Computing (STOC20
Power of Choices with Simple Tabulation
Suppose that we are to place balls into bins sequentially using the
-choice paradigm: For each ball we are given a choice of bins, according
to hash functions and we place the ball in the least loaded
of these bins breaking ties arbitrarily. Our interest is in the number of balls
in the fullest bin after all balls have been placed.
Azar et al. [STOC'94] proved that when and when the hash functions
are fully random the maximum load is at most
whp (i.e. with probability for any choice of ).
In this paper we suppose that the are simple tabulation hash
functions. Generalising a result by Dahlgaard et al [SODA'16] we show that for
an arbitrary constant the maximum load is whp, and
that expected maximum load is at most . We
further show that by using a simple tie-breaking algorithm introduced by
V\"ocking [J.ACM'03] the expected maximum load drops to where is the rate of growth of the -ary
Fibonacci numbers. Both of these expected bounds match those of the fully
random setting.
The analysis by Dahlgaard et al. relies on a proof by P\u{a}tra\c{s}cu and
Thorup [J.ACM'11] concerning the use of simple tabulation for cuckoo hashing.
We need here a generalisation to hash functions, but the original proof
is an 8-page tour de force of ad-hoc arguments that do not appear to
generalise. Our main technical contribution is a shorter, simpler and more
accessible proof of the result by P\u{a}tra\c{s}cu and Thorup, where the
relevant parts generalise nicely to the analysis of choices.Comment: Accepted at ICALP 201
Quicksort, Largest Bucket, and Min-Wise Hashing with Limited Independence
Randomized algorithms and data structures are often analyzed under the
assumption of access to a perfect source of randomness. The most fundamental
metric used to measure how "random" a hash function or a random number
generator is, is its independence: a sequence of random variables is said to be
-independent if every variable is uniform and every size subset is
independent. In this paper we consider three classic algorithms under limited
independence. We provide new bounds for randomized quicksort, min-wise hashing
and largest bucket size under limited independence. Our results can be
summarized as follows.
-Randomized quicksort. When pivot elements are computed using a
-independent hash function, Karloff and Raghavan, J.ACM'93 showed expected worst-case running time for a special version of quicksort.
We improve upon this, showing that the same running time is achieved with only
-independence.
-Min-wise hashing. For a set , consider the probability of a particular
element being mapped to the smallest hash value. It is known that
-independence implies the optimal probability . Broder et al.,
STOC'98 showed that -independence implies it is . We show
a matching lower bound as well as new tight bounds for - and -independent
hash functions.
-Largest bucket. We consider the case where balls are distributed to
buckets using a -independent hash function and analyze the largest bucket
size. Alon et. al, STOC'97 showed that there exists a -independent hash
function implying a bucket of size . We generalize the
bound, providing a -independent family of functions that imply size .Comment: Submitted to ICALP 201
- …