research

Fast hashing with Strong Concentration Bounds

Abstract

Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of c=O(1)c=O(1) characters, e.g., a 64-bit key as c=8c=8 characters of 8-bits. The character domain Σ\Sigma should be small enough that character tables of size Σ|\Sigma| fit in fast cache. The schemes then use O(1)O(1) tables of this size, so the space of tabulation hashing is O(Σ)O(|\Sigma|). However, the concentration bounds by Patrascu and Thorup only apply if the expected sums are Σ\ll |\Sigma|. To see the problem, consider the very simple case where we use tabulation hashing to throw nn balls into mm bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if n=mn=m, for then the expected value is 11. However, if m=2m=2, as when tossing nn unbiased coins, the expected value n/2n/2 is Σ\gg |\Sigma| for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call \emph{tabulation-permutation} hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.Comment: 54 pages, 3 figures. An extended abstract appeared at the 52nd Annual ACM Symposium on Theory of Computing (STOC20

    Similar works

    Full text

    thumbnail-image

    Available Versions

    Last time updated on 10/08/2021