Search CORE

85 research outputs found

Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence

Author: Thorup Mikkel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

Simple tabulation dates back to Zobrist in 1970. Keys are viewed as c characters from some alphabet A. We initialize c tables h_0, ..., h_{c-1} mapping characters to random hash values. A key x=(x_0, ..., x_{c-1}) is hashed to h_0[x_0] xor...xor h_{c-1}[x_{c-1}]. The scheme is extremely fast when the character hash tables h_i are in cache. Simple tabulation hashing is not 4-independent, but we show that if we apply it twice, then we get high independence. First we hash to intermediate keys that are 6 times longer than the original keys, and then we hash the intermediate keys to the final hash values. The intermediate keys have d=6c characters from A. We can view the hash function as a degree d bipartite graph with keys on one side, each with edges to d output characters. We show that this graph has nice expansion properties, and from that we get that with another level of simple tabulation on the intermediate keys, the composition is a highly independent hash function. The independence we get is |A|^{Omega(1/c)}. Our space is O(c|A|) and the hash function is evaluated in O(c) time. Siegel [FOCS'89, SICOMP'04] proved that with this space, if the hash function is evaluated in o(c) time, then the independence can only be o(c), so our evaluation time is best possible for Omega(c) independence---our independence is much higher if c=|A|^{o(1)}. Siegel used O(c)^c evaluation time to get the same independence with similar space. Siegel's main focus was c=O(1), but we are exponentially faster when c=omega(1). Applying our scheme recursively, we can increase our independence to |A|^{Omega(1)} with o(c^{log c}) evaluation time. Compared with Siegel's scheme this is both faster and higher independence. Our scheme is easy to implement, and it does provide realistic implementations of 100-independent hashing for, say, 32 and 64-bit keys

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Approximately Minwise Independence with Twisted Tabulation

Author: A. Broder
A.Z. Broder
E. Cohen
M. Datar
M. Pǎtraşcu
R.E. Fan
Y. Bachrach
Publication venue
Publication date: 01/01/2014
Field of study

A random hash function

h

\varepsilon

-minwise if for any set

S

|S|=n

, and element

x\in S

\Pr[h(x)=\min h(S)]=(1\pm\varepsilon)/n

. Minwise hash functions with low bias

\varepsilon

have widespread applications within similarity estimation. Hashing from a universe

[u]

, the twisted tabulation hashing of P\v{a}tra\c{s}cu and Thorup [SODA'13] makes

c=O(1)

lookups in tables of size

u^{1/c}

. Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields

\tilde O(1/u^{1/c})

-minwise hashing. In the classic independence paradigm of Wegman and Carter [FOCS'79]

\tilde O(1/u^{1/c})

-minwise hashing requires

\Omega(\log u)

-independence [Indyk SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple tabulation, using same space and lookups yields

\tilde O(1/n^{1/c})

-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument.Comment: To appear in Proceedings of SWAT 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Copenhagen University Research Information System

Fast and Powerful Hashing using Tabulation

Author: Thorup Mikkel
Publication venue
Publication date: 01/01/2016
Field of study

Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Here we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees. Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as consisting of

c

characters and we have precomputed character tables

h_1,...,h_c

mapping characters to random hash values. A key

x=(x_1,...,x_c)

is hashed to

h_1[x_1] \oplus h_2[x_2].....\oplus h_c[x_c]

. This schemes is very fast with character tables in cache. While simple tabulation is not even 4-independent, it does provide many of the guarantees that are normally obtained via higher independence, e.g., linear probing and Cuckoo hashing. Next we consider twisted tabulation where one input character is "twisted" in a simple way. The resulting hash function has powerful distributional properties: Chernoff-Hoeffding type tail bounds and a very small bias for min-wise hashing. This also yields an extremely fast pseudo-random number generator that is provably good for many classic randomized algorithms and data-structures. Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Carter and Wegman [1977]. In fact, w.h.p., for a given set of size proportional to that of the space consumed, double tabulation gives fully-random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space. While these tabulation schemes are all easy to implement and use, their analysis is not

arXiv.org e-Print Archive

Copenhagen University Research Information System

Dagstuhl Research Online Publication Server

Fast hashing with Strong Concentration Bounds

Author: Aamand Anders
Bernstein Sergei Natanovich
Celis L. Elisa
Dahlgaard Søren
Dumey A. I.
Meka Raghu
Mitzenmacher Michael
şcu Mihai P
şcu Mihai P
şcu Mihai P
Publication venue
Publication date: 01/01/2020
Field of study

Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of

c=O(1)

characters, e.g., a 64-bit key as

c=8

characters of 8-bits. The character domain

\Sigma

should be small enough that character tables of size

|\Sigma|

fit in fast cache. The schemes then use

O(1)

tables of this size, so the space of tabulation hashing is

O(|\Sigma|)

. However, the concentration bounds by Patrascu and Thorup only apply if the expected sums are

\ll |\Sigma|

. To see the problem, consider the very simple case where we use tabulation hashing to throw

n

balls into

m

bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if

n=m

, for then the expected value is

1

. However, if

m=2

, as when tossing

n

unbiased coins, the expected value

n/2

\gg |\Sigma|

for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call \emph{tabulation-permutation} hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.Comment: 54 pages, 3 figures. An extended abstract appeared at the 52nd Annual ACM Symposium on Theory of Computing (STOC20

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Power of d Choices with Simple Tabulation

Author: Aamand Anders
Thorup Mikkel
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)
Publication date: 01/01/2018
Field of study

Dagstuhl Research Online Publication Server

A Sparse Johnson-Lindenstrauss Transform Using Fast Hashing

Author: Thorup Mikkel
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 50th International Colloquium on Automata, Languages, and Programming (ICALP 2023)
Publication date: 01/01/2023
Field of study

Dagstuhl Research Online Publication Server

Power of $d$ Choices with Simple Tabulation

Author: Aamand Anders
Knudsen Mathias Bæk Tejs
Thorup Mikkel
Publication venue
Publication date: 25/04/2018
Field of study

Suppose that we are to place

m

balls into

n

bins sequentially using the

d

-choice paradigm: For each ball we are given a choice of

d

bins, according to

d

hash functions

h_1,\dots,h_d

and we place the ball in the least loaded of these bins breaking ties arbitrarily. Our interest is in the number of balls in the fullest bin after all

m

balls have been placed. Azar et al. [STOC'94] proved that when

m=O(n)

and when the hash functions are fully random the maximum load is at most

\frac{\lg \lg n }{\lg d}+O(1)

whp (i.e. with probability

1-O(n^{-\gamma})

for any choice of

\gamma

). In this paper we suppose that the

h_1,\dots,h_d

are simple tabulation hash functions. Generalising a result by Dahlgaard et al [SODA'16] we show that for an arbitrary constant

d\geq 2

the maximum load is

O(\lg \lg n)

whp, and that expected maximum load is at most

\frac{\lg \lg n}{\lg d}+O(1)

. We further show that by using a simple tie-breaking algorithm introduced by V\"ocking [J.ACM'03] the expected maximum load drops to

\frac{\lg \lg n}{d\lg \varphi_d}+O(1)

where

\varphi_d

is the rate of growth of the

d

-ary Fibonacci numbers. Both of these expected bounds match those of the fully random setting. The analysis by Dahlgaard et al. relies on a proof by P\u{a}tra\c{s}cu and Thorup [J.ACM'11] concerning the use of simple tabulation for cuckoo hashing. We need here a generalisation to

d>2

hash functions, but the original proof is an 8-page tour de force of ad-hoc arguments that do not appear to generalise. Our main technical contribution is a shorter, simpler and more accessible proof of the result by P\u{a}tra\c{s}cu and Thorup, where the relevant parts generalise nicely to the analysis of

d

choices.Comment: Accepted at ICALP 201

arXiv.org e-Print Archive

Copenhagen University Research Information System

Quicksort, Largest Bucket, and Min-Wise Hashing with Limited Independence

Author: A. Siegel
H. Karloff
J.L. Carter
J.P. Schmidt
M. Dietzfelbinger
M. Pǎtraşcu
R. Motwani
T. Christiani
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Randomized algorithms and data structures are often analyzed under the assumption of access to a perfect source of randomness. The most fundamental metric used to measure how "random" a hash function or a random number generator is, is its independence: a sequence of random variables is said to be

k

-independent if every variable is uniform and every size

k

subset is independent. In this paper we consider three classic algorithms under limited independence. We provide new bounds for randomized quicksort, min-wise hashing and largest bucket size under limited independence. Our results can be summarized as follows. -Randomized quicksort. When pivot elements are computed using a

5

-independent hash function, Karloff and Raghavan, J.ACM'93 showed

O ( n \log n)

expected worst-case running time for a special version of quicksort. We improve upon this, showing that the same running time is achieved with only

4

-independence. -Min-wise hashing. For a set

A

, consider the probability of a particular element being mapped to the smallest hash value. It is known that

5

-independence implies the optimal probability

O (1 /n)

. Broder et al., STOC'98 showed that

2

-independence implies it is

O(1 / \sqrt{|A|})

. We show a matching lower bound as well as new tight bounds for

3

- and

4

-independent hash functions. -Largest bucket. We consider the case where

n

balls are distributed to

n

buckets using a

k

-independent hash function and analyze the largest bucket size. Alon et. al, STOC'97 showed that there exists a

2

-independent hash function implying a bucket of size

\Omega ( n^{1/2})

. We generalize the bound, providing a

k

-independent family of functions that imply size

\Omega ( n^{1/k})

.Comment: Submitted to ICALP 201

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

The IT University of Copenhagen's Repository