264 research outputs found
Fast and Powerful Hashing using Tabulation
Randomized algorithms are often enjoyed for their simplicity, but the hash
functions employed to yield the desired probabilistic guarantees are often too
complicated to be practical. Here we survey recent results on how simple
hashing schemes based on tabulation provide unexpectedly strong guarantees.
Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as
consisting of characters and we have precomputed character tables
mapping characters to random hash values. A key
is hashed to . This schemes is
very fast with character tables in cache. While simple tabulation is not even
4-independent, it does provide many of the guarantees that are normally
obtained via higher independence, e.g., linear probing and Cuckoo hashing.
Next we consider twisted tabulation where one input character is "twisted" in
a simple way. The resulting hash function has powerful distributional
properties: Chernoff-Hoeffding type tail bounds and a very small bias for
min-wise hashing. This also yields an extremely fast pseudo-random number
generator that is provably good for many classic randomized algorithms and
data-structures.
Finally, we consider double tabulation where we compose two simple tabulation
functions, applying one to the output of the other, and show that this yields
very high independence in the classic framework of Carter and Wegman [1977]. In
fact, w.h.p., for a given set of size proportional to that of the space
consumed, double tabulation gives fully-random hashing. We also mention some
more elaborate tabulation schemes getting near-optimal independence for given
time and space.
While these tabulation schemes are all easy to implement and use, their
analysis is not
Fast hashing with Strong Concentration Bounds
Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on
simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style
concentration bounds on hash based sums, e.g., the number of balls/keys hashing
to a given bin, but under some quite severe restrictions on the expected values
of these sums. The basic idea in tabulation hashing is to view a key as
consisting of characters, e.g., a 64-bit key as characters of
8-bits. The character domain should be small enough that character
tables of size fit in fast cache. The schemes then use tables
of this size, so the space of tabulation hashing is . However, the
concentration bounds by Patrascu and Thorup only apply if the expected sums are
.
To see the problem, consider the very simple case where we use tabulation
hashing to throw balls into bins and want to analyse the number of
balls in a given bin. With their concentration bounds, we are fine if ,
for then the expected value is . However, if , as when tossing
unbiased coins, the expected value is for large data sets,
e.g., data sets that do not fit in fast cache.
To handle expectations that go beyond the limits of our small space, we need
a much more advanced analysis of simple tabulation, plus a new tabulation
technique that we call \emph{tabulation-permutation} hashing which is at most
twice as slow as simple tabulation. No other hashing scheme of comparable speed
offers similar Chernoff-style concentration bounds.Comment: 54 pages, 3 figures. An extended abstract appeared at the 52nd Annual
ACM Symposium on Theory of Computing (STOC20
The universality of iterated hashing over variable-length strings
Iterated hash functions process strings recursively, one character at a time.
At each iteration, they compute a new hash value from the preceding hash value
and the next character. We prove that iterated hashing can be pairwise
independent, but never 3-wise independent. We show that it can be almost
universal over strings much longer than the number of hash values; we bound the
maximal string length given the collision probability
- …