Search CORE

5 research outputs found

Fast hashing with Strong Concentration Bounds

Author: Aamand Anders
Bernstein Sergei Natanovich
Celis L. Elisa
Dahlgaard Søren
Dumey A. I.
Meka Raghu
Mitzenmacher Michael
şcu Mihai P
şcu Mihai P
şcu Mihai P
Publication venue
Publication date: 01/01/2020
Field of study

Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of

c=O(1)

characters, e.g., a 64-bit key as

c=8

characters of 8-bits. The character domain

\Sigma

should be small enough that character tables of size

|\Sigma|

fit in fast cache. The schemes then use

O(1)

tables of this size, so the space of tabulation hashing is

O(|\Sigma|)

. However, the concentration bounds by Patrascu and Thorup only apply if the expected sums are

\ll |\Sigma|

. To see the problem, consider the very simple case where we use tabulation hashing to throw

n

balls into

m

bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if

n=m

, for then the expected value is

1

. However, if

m=2

, as when tossing

n

unbiased coins, the expected value

n/2

\gg |\Sigma|

for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call \emph{tabulation-permutation} hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.Comment: 54 pages, 3 figures. An extended abstract appeared at the 52nd Annual ACM Symposium on Theory of Computing (STOC20

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

A Sparse Johnson-Lindenstrauss Transform Using Fast Hashing

Author: Thorup Mikkel
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 50th International Colloquium on Automata, Languages, and Programming (ICALP 2023)
Publication date: 01/01/2023
Field of study

Dagstuhl Research Online Publication Server

A Fair and Memory/Time-efficient Hashmap

Author: Asudeh Abolfazl
Shahbazi Nima
Sintos Stavros
Publication venue
Publication date: 21/07/2023
Field of study

There is a large amount of work constructing hashmaps to minimize the number of collisions. However, to the best of our knowledge no known hashing technique guarantees group fairness among different groups of items. We are given a set

P

n

tuples in

\mathbb{R}^d

, for a constant dimension

d

and a set of groups

\mathcal{G}=\{\mathbf{g}_1,\ldots, \mathbf{g}_k\}

such that every tuple belongs to a unique group. We formally define the fair hashing problem introducing the notions of single fairness (

Pr[h(p)=h(x)\mid p\in \mathbf{g}_i, x\in P]

for every

i=1,\ldots, k

), pairwise fairness (

Pr[h(p)=h(q)\mid p,q\in \mathbf{g}_i]

for every

i=1,\ldots, k

), and the well-known collision probability (

Pr[h(p)=h(q)\mid p,q\in P]

). The goal is to construct a hashmap such that the collision probability, the single fairness, and the pairwise fairness are close to

1/m

, where

m

is the number of buckets in the hashmap. We propose two families of algorithms to design fair hashmaps. First, we focus on hashmaps with optimum memory consumption minimizing the unfairness. We model the input tuples as points in

\mathbb{R}^d

and the goal is to find the vector

w

such that the projection of

P

onto

w

creates an ordering that is convenient to split to create a fair hashmap. For each projection we design efficient algorithms that find near optimum partitions of exactly (or at most)

m

buckets. Second, we focus on hashmaps with optimum fairness (

0

-unfairness), minimizing the memory consumption. We make the important observation that the fair hashmap problem is reduced to the necklace splitting problem. By carefully implementing algorithms for solving the necklace splitting problem, we propose faster algorithms constructing hashmaps with

0

-unfairness using

2(m-1)

boundary points when

k=2

and

k(m-1)(4+\log_2 (3mn))

boundary points for

k>2

arXiv.org e-Print Archive

Locally Uniform Hashing

Author: Bercea Ioana O.
Beretta Lorenzo
Houen Jakob Bæk Tejs
Klausen Jonas
Thorup Mikkel
Publication venue
Publication date: 28/09/2023
Field of study

Hashing is a common technique used in data processing, with a strong impact on the time and resources spent on computation. Hashing also affects the applicability of theoretical results that often assume access to (unrealistic) uniform/fully-random hash functions. In this paper, we are concerned with designing hash functions that are practical and come with strong theoretical guarantees on their performance. To this end, we present tornado tabulation hashing, which is simple, fast, and exhibits a certain full, local randomness property that provably makes diverse algorithms perform almost as if (abstract) fully-random hashing was used. For example, this includes classic linear probing, the widely used HyperLogLog algorithm of Flajolet, Fusy, Gandouet, Meunier [AOFA 97] for counting distinct elements, and the one-permutation hashing of Li, Owen, and Zhang [NIPS 12] for large-scale machine learning. We also provide a very efficient solution for the classical problem of obtaining fully-random hashing on a fixed (but unknown to the hash function) set of

n

keys using

O(n)

space. As a consequence, we get more efficient implementations of the splitting trick of Dietzfelbinger and Rink [ICALP'09] and the succinct space uniform hashing of Pagh and Pagh [SICOMP'08]. Tornado tabulation hashing is based on a simple method to systematically break dependencies in tabulation-based hashing techniques.Comment: FOCS 202

arXiv.org e-Print Archive

LIPIcs, Volume 261, ICALP 2023, Complete Volume

Author: Etessami Kousha
Feige Uriel
Puppis Gabriele
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 50th International Colloquium on Automata, Languages, and Programming (ICALP 2023)
Publication date: 01/01/2023
Field of study

LIPIcs, Volume 261, ICALP 2023, Complete Volum

Dagstuhl Research Online Publication Server