Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on
simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style
concentration bounds on hash based sums, e.g., the number of balls/keys hashing
to a given bin, but under some quite severe restrictions on the expected values
of these sums. The basic idea in tabulation hashing is to view a key as
consisting of c=O(1) characters, e.g., a 64-bit key as c=8 characters of
8-bits. The character domain Σ should be small enough that character
tables of size ∣Σ∣ fit in fast cache. The schemes then use O(1) tables
of this size, so the space of tabulation hashing is O(∣Σ∣). However, the
concentration bounds by Patrascu and Thorup only apply if the expected sums are
≪∣Σ∣.
To see the problem, consider the very simple case where we use tabulation
hashing to throw n balls into m bins and want to analyse the number of
balls in a given bin. With their concentration bounds, we are fine if n=m,
for then the expected value is 1. However, if m=2, as when tossing n
unbiased coins, the expected value n/2 is ≫∣Σ∣ for large data sets,
e.g., data sets that do not fit in fast cache.
To handle expectations that go beyond the limits of our small space, we need
a much more advanced analysis of simple tabulation, plus a new tabulation
technique that we call \emph{tabulation-permutation} hashing which is at most
twice as slow as simple tabulation. No other hashing scheme of comparable speed
offers similar Chernoff-style concentration bounds.Comment: 54 pages, 3 figures. An extended abstract appeared at the 52nd Annual
ACM Symposium on Theory of Computing (STOC20