5 research outputs found
Fast hashing with Strong Concentration Bounds
Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on
simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style
concentration bounds on hash based sums, e.g., the number of balls/keys hashing
to a given bin, but under some quite severe restrictions on the expected values
of these sums. The basic idea in tabulation hashing is to view a key as
consisting of characters, e.g., a 64-bit key as characters of
8-bits. The character domain should be small enough that character
tables of size fit in fast cache. The schemes then use tables
of this size, so the space of tabulation hashing is . However, the
concentration bounds by Patrascu and Thorup only apply if the expected sums are
.
To see the problem, consider the very simple case where we use tabulation
hashing to throw balls into bins and want to analyse the number of
balls in a given bin. With their concentration bounds, we are fine if ,
for then the expected value is . However, if , as when tossing
unbiased coins, the expected value is for large data sets,
e.g., data sets that do not fit in fast cache.
To handle expectations that go beyond the limits of our small space, we need
a much more advanced analysis of simple tabulation, plus a new tabulation
technique that we call \emph{tabulation-permutation} hashing which is at most
twice as slow as simple tabulation. No other hashing scheme of comparable speed
offers similar Chernoff-style concentration bounds.Comment: 54 pages, 3 figures. An extended abstract appeared at the 52nd Annual
ACM Symposium on Theory of Computing (STOC20
A Fair and Memory/Time-efficient Hashmap
There is a large amount of work constructing hashmaps to minimize the number
of collisions. However, to the best of our knowledge no known hashing technique
guarantees group fairness among different groups of items. We are given a set
of tuples in , for a constant dimension and a set of
groups such that every
tuple belongs to a unique group. We formally define the fair hashing problem
introducing the notions of single fairness ( for every ), pairwise fairness
( for every ), and the
well-known collision probability (). The goal is to
construct a hashmap such that the collision probability, the single fairness,
and the pairwise fairness are close to , where is the number of
buckets in the hashmap.
We propose two families of algorithms to design fair hashmaps. First, we
focus on hashmaps with optimum memory consumption minimizing the unfairness. We
model the input tuples as points in and the goal is to find the
vector such that the projection of onto creates an ordering that is
convenient to split to create a fair hashmap. For each projection we design
efficient algorithms that find near optimum partitions of exactly (or at most)
buckets. Second, we focus on hashmaps with optimum fairness
(-unfairness), minimizing the memory consumption. We make the important
observation that the fair hashmap problem is reduced to the necklace splitting
problem. By carefully implementing algorithms for solving the necklace
splitting problem, we propose faster algorithms constructing hashmaps with
-unfairness using boundary points when and boundary points for
Locally Uniform Hashing
Hashing is a common technique used in data processing, with a strong impact
on the time and resources spent on computation. Hashing also affects the
applicability of theoretical results that often assume access to (unrealistic)
uniform/fully-random hash functions. In this paper, we are concerned with
designing hash functions that are practical and come with strong theoretical
guarantees on their performance.
To this end, we present tornado tabulation hashing, which is simple, fast,
and exhibits a certain full, local randomness property that provably makes
diverse algorithms perform almost as if (abstract) fully-random hashing was
used. For example, this includes classic linear probing, the widely used
HyperLogLog algorithm of Flajolet, Fusy, Gandouet, Meunier [AOFA 97] for
counting distinct elements, and the one-permutation hashing of Li, Owen, and
Zhang [NIPS 12] for large-scale machine learning. We also provide a very
efficient solution for the classical problem of obtaining fully-random hashing
on a fixed (but unknown to the hash function) set of keys using
space. As a consequence, we get more efficient implementations of the splitting
trick of Dietzfelbinger and Rink [ICALP'09] and the succinct space uniform
hashing of Pagh and Pagh [SICOMP'08].
Tornado tabulation hashing is based on a simple method to systematically
break dependencies in tabulation-based hashing techniques.Comment: FOCS 202
LIPIcs, Volume 261, ICALP 2023, Complete Volume
LIPIcs, Volume 261, ICALP 2023, Complete Volum