Search CORE

126 research outputs found

The universality of iterated hashing over variable-length strings

Author: Byers
Carter
Cohen
Daniel Lemire
Knuth
Krawczyk
Krawczyk
Krovetz
Kukich
Lemire
Liskov
Pagh
Pearson
Piret
Preneel
Ramakrishna
Rogaway
Sarkar
Shoup
Stinson
Zobrist
Publication venue: 'Elsevier BV'
Publication date: 24/11/2011
Field of study

Iterated hash functions process strings recursively, one character at a time. At each iteration, they compute a new hash value from the preceding hash value and the next character. We prove that iterated hashing can be pairwise independent, but never 3-wise independent. We show that it can be almost universal over strings much longer than the number of hash values; we bound the maximal string length given the collision probability

arXiv.org e-Print Archive

Fast and Powerful Hashing using Tabulation

Author: Thorup Mikkel
Publication venue
Publication date: 01/01/2016
Field of study

Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Here we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees. Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as consisting of

c

characters and we have precomputed character tables

h_1,...,h_c

mapping characters to random hash values. A key

x=(x_1,...,x_c)

is hashed to

h_1[x_1] \oplus h_2[x_2].....\oplus h_c[x_c]

. This schemes is very fast with character tables in cache. While simple tabulation is not even 4-independent, it does provide many of the guarantees that are normally obtained via higher independence, e.g., linear probing and Cuckoo hashing. Next we consider twisted tabulation where one input character is "twisted" in a simple way. The resulting hash function has powerful distributional properties: Chernoff-Hoeffding type tail bounds and a very small bias for min-wise hashing. This also yields an extremely fast pseudo-random number generator that is provably good for many classic randomized algorithms and data-structures. Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Carter and Wegman [1977]. In fact, w.h.p., for a given set of size proportional to that of the space consumed, double tabulation gives fully-random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space. While these tabulation schemes are all easy to implement and use, their analysis is not

arXiv.org e-Print Archive

Copenhagen University Research Information System

Dagstuhl Research Online Publication Server

On the k-Independence Required by Linear Probing and Minwise Independence

Author: A. Pagh
A.Z. Broder
A.Z. Broder
E. Cohen
J.P. Schmidt
M.N. Wegman
P. Indyk
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Fast hashing with Strong Concentration Bounds

Author: Aamand Anders
Bernstein Sergei Natanovich
Celis L. Elisa
Dahlgaard Søren
Dumey A. I.
Meka Raghu
Mitzenmacher Michael
şcu Mihai P
şcu Mihai P
şcu Mihai P
Publication venue
Publication date: 01/01/2020
Field of study

Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of

c=O(1)

characters, e.g., a 64-bit key as

c=8

characters of 8-bits. The character domain

\Sigma

should be small enough that character tables of size

|\Sigma|

fit in fast cache. The schemes then use

O(1)

tables of this size, so the space of tabulation hashing is

O(|\Sigma|)

. However, the concentration bounds by Patrascu and Thorup only apply if the expected sums are

\ll |\Sigma|

. To see the problem, consider the very simple case where we use tabulation hashing to throw

n

balls into

m

bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if

n=m

, for then the expected value is

1

. However, if

m=2

, as when tossing

n

unbiased coins, the expected value

n/2

\gg |\Sigma|

for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call \emph{tabulation-permutation} hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.Comment: 54 pages, 3 figures. An extended abstract appeared at the 52nd Annual ACM Symposium on Theory of Computing (STOC20

arXiv.org e-Print Archive

Copenhagen University Research Information System

CountSketches, Feature Hashing and the Median of Three

Author: Larsen Kasper Green
Pagh Rasmus
Tětek Jakub
Publication venue
Publication date: 01/01/2021
Field of study

In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector

v

to a vector of dimension

(2t-1) s

, where

t, s > 0

are integer parameters. It is known that even for

t=1

, a CountSketch allows estimating coordinates of

v

with variance bounded by

\|v\|_2^2/s

. For

t > 1

, the estimator takes the median of

2t-1

independent estimates, and the probability that the estimate is off by more than

2 \|v\|_2/\sqrt{s}

is exponentially small in

t

. This suggests choosing

t

to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant

t

. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of Count-Sketch, showing an improvement in variance to

O(\min\{\|v\|_1^2/s^2,\|v\|_2^2/s\})

when

t > 1

. That is, the variance decreases proportionally to

s^{-2}

, asymptotically for large enough

s

. We also study the variance in the setting where an inner product is to be estimated from two CountSketches. This finding suggests that the Feature Hashing method, which is essentially identical to CountSketch but does not make use of the median estimator, can be made more reliable at a small cost in settings where using a median estimator is possible. We confirm our theoretical findings in experiments and thereby help justify why a small constant number of estimates often suffice in practice. Our improved variance bounds are based on new general theorems about the variance and higher moments of the median of i.i.d. random variables that may be of independent interest

arXiv.org e-Print Archive

Copenhagen University Research Information System