5 research outputs found

    Fast hashing with Strong Concentration Bounds

    Full text link
    Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of c=O(1)c=O(1) characters, e.g., a 64-bit key as c=8c=8 characters of 8-bits. The character domain Σ\Sigma should be small enough that character tables of size Σ|\Sigma| fit in fast cache. The schemes then use O(1)O(1) tables of this size, so the space of tabulation hashing is O(Σ)O(|\Sigma|). However, the concentration bounds by Patrascu and Thorup only apply if the expected sums are Σ\ll |\Sigma|. To see the problem, consider the very simple case where we use tabulation hashing to throw nn balls into mm bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if n=mn=m, for then the expected value is 11. However, if m=2m=2, as when tossing nn unbiased coins, the expected value n/2n/2 is Σ\gg |\Sigma| for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call \emph{tabulation-permutation} hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.Comment: 54 pages, 3 figures. An extended abstract appeared at the 52nd Annual ACM Symposium on Theory of Computing (STOC20

    A Sparse Johnson-Lindenstrauss Transform Using Fast Hashing

    Get PDF

    A Fair and Memory/Time-efficient Hashmap

    Full text link
    There is a large amount of work constructing hashmaps to minimize the number of collisions. However, to the best of our knowledge no known hashing technique guarantees group fairness among different groups of items. We are given a set PP of nn tuples in Rd\mathbb{R}^d, for a constant dimension dd and a set of groups G={g1,,gk}\mathcal{G}=\{\mathbf{g}_1,\ldots, \mathbf{g}_k\} such that every tuple belongs to a unique group. We formally define the fair hashing problem introducing the notions of single fairness (Pr[h(p)=h(x)pgi,xP]Pr[h(p)=h(x)\mid p\in \mathbf{g}_i, x\in P] for every i=1,,ki=1,\ldots, k), pairwise fairness (Pr[h(p)=h(q)p,qgi]Pr[h(p)=h(q)\mid p,q\in \mathbf{g}_i] for every i=1,,ki=1,\ldots, k), and the well-known collision probability (Pr[h(p)=h(q)p,qP]Pr[h(p)=h(q)\mid p,q\in P]). The goal is to construct a hashmap such that the collision probability, the single fairness, and the pairwise fairness are close to 1/m1/m, where mm is the number of buckets in the hashmap. We propose two families of algorithms to design fair hashmaps. First, we focus on hashmaps with optimum memory consumption minimizing the unfairness. We model the input tuples as points in Rd\mathbb{R}^d and the goal is to find the vector ww such that the projection of PP onto ww creates an ordering that is convenient to split to create a fair hashmap. For each projection we design efficient algorithms that find near optimum partitions of exactly (or at most) mm buckets. Second, we focus on hashmaps with optimum fairness (00-unfairness), minimizing the memory consumption. We make the important observation that the fair hashmap problem is reduced to the necklace splitting problem. By carefully implementing algorithms for solving the necklace splitting problem, we propose faster algorithms constructing hashmaps with 00-unfairness using 2(m1)2(m-1) boundary points when k=2k=2 and k(m1)(4+log2(3mn))k(m-1)(4+\log_2 (3mn)) boundary points for k>2k>2

    Locally Uniform Hashing

    Full text link
    Hashing is a common technique used in data processing, with a strong impact on the time and resources spent on computation. Hashing also affects the applicability of theoretical results that often assume access to (unrealistic) uniform/fully-random hash functions. In this paper, we are concerned with designing hash functions that are practical and come with strong theoretical guarantees on their performance. To this end, we present tornado tabulation hashing, which is simple, fast, and exhibits a certain full, local randomness property that provably makes diverse algorithms perform almost as if (abstract) fully-random hashing was used. For example, this includes classic linear probing, the widely used HyperLogLog algorithm of Flajolet, Fusy, Gandouet, Meunier [AOFA 97] for counting distinct elements, and the one-permutation hashing of Li, Owen, and Zhang [NIPS 12] for large-scale machine learning. We also provide a very efficient solution for the classical problem of obtaining fully-random hashing on a fixed (but unknown to the hash function) set of nn keys using O(n)O(n) space. As a consequence, we get more efficient implementations of the splitting trick of Dietzfelbinger and Rink [ICALP'09] and the succinct space uniform hashing of Pagh and Pagh [SICOMP'08]. Tornado tabulation hashing is based on a simple method to systematically break dependencies in tabulation-based hashing techniques.Comment: FOCS 202

    LIPIcs, Volume 261, ICALP 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 261, ICALP 2023, Complete Volum
    corecore