Search CORE

8 research outputs found

Fast Similarity Sketching

Author: Dahlgaard Søren
Knudsen Mathias Bæk Tejs
Thorup Mikkel
Publication venue
Publication date: 01/01/2017
Field of study

We consider the Similarity Sketching problem: Given a universe

[u]= \{0,\ldots,u-1\}

we want a random function

S

mapping subsets

A\subseteq [u]

into vectors

S(A)

of size

t

, such that similarity is preserved. More precisely: Given sets

A,B\subseteq [u]

, define

X_i=[S(A)[i]= S(B)[i]]

and

X=\sum_{i\in [t]}X_i

. We want to have

E[X]=t\cdot J(A,B)

, where

J(A,B)=|A\cap B|/|A\cup B|

and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for

X

. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors

S(A)

are also called sketches. The seminal

t\times

MinHash algorithm uses

t

random hash functions

h_1,\ldots, h_t

, and stores

\left(\min_{a\in A}h_1(A),\ldots, \min_{a\in A}h_t(A)\right)

as the sketch of

A

. The main drawback of MinHash is, however, its

O(t\cdot |A|)

running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS'12] introduced one permutation hashing (OPH), which creates a sketch of size

t

O(t + |A|)

time, but with the drawback that possibly some of the

t

entries are "empty" when

|A| = O(t)

. One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML'14]. However, these "densification" schemes fail to provide good concentration bounds exactly in the case

|A| = O(t)

, where they are needed. (continued...

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Practical Hash Functions for Similarity Estimation and Dimensionality Reduction

Author: Dahlgaard Søren
Knudsen Mathias Bæk Tejs
Thorup Mikkel
Publication venue
Publication date: 01/01/2017
Field of study

Hashing is a basic tool for dimensionality reduction employed in several aspects of machine learning. However, the perfomance analysis is often carried out under the abstract assumption that a truly random unit cost hash function is used, without concern for which concrete hash function is employed. The concrete hash function may work fine on sufficiently random input. The question is if it can be trusted in the real world when faced with more structured input. In this paper we focus on two prominent applications of hashing, namely similarity estimation with the one permutation hashing (OPH) scheme of Li et al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of which have found numerous applications, i.e. in approximate near-neighbour search with LSH and large-scale classification with SVM. We consider mixed tabulation hashing of Dahlgaard et al.[FOCS'15] which was proved to perform like a truly random hash function in many applications, including OPH. Here we first show improved concentration bounds for FH with truly random hashing and then argue that mixed tabulation performs similar for sparse input. Our main contribution, however, is an experimental comparison of different hashing schemes when used inside FH, OPH, and LSH. We find that mixed tabulation hashing is almost as fast as the multiply-mod-prime scheme ax+b mod p. Mutiply-mod-prime is guaranteed to work well on sufficiently random data, but we demonstrate that in the above applications, it can lead to bias and poor concentration on both real-world and synthetic data. We also compare with the popular MurmurHash3, which has no proven guarantees. Mixed tabulation and MurmurHash3 both perform similar to truly random hashing in our experiments. However, mixed tabulation is 40% faster than MurmurHash3, and it has the proven guarantee of good performance on all possible input.Comment: Preliminary version of this paper will appear at NIPS 201

arXiv.org e-Print Archive

Copenhagen University Research Information System

Fast and Powerful Hashing using Tabulation

Author: Thorup Mikkel
Publication venue
Publication date: 01/01/2016
Field of study

Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Here we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees. Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as consisting of

c

characters and we have precomputed character tables

h_1,...,h_c

mapping characters to random hash values. A key

x=(x_1,...,x_c)

is hashed to

h_1[x_1] \oplus h_2[x_2].....\oplus h_c[x_c]

. This schemes is very fast with character tables in cache. While simple tabulation is not even 4-independent, it does provide many of the guarantees that are normally obtained via higher independence, e.g., linear probing and Cuckoo hashing. Next we consider twisted tabulation where one input character is "twisted" in a simple way. The resulting hash function has powerful distributional properties: Chernoff-Hoeffding type tail bounds and a very small bias for min-wise hashing. This also yields an extremely fast pseudo-random number generator that is provably good for many classic randomized algorithms and data-structures. Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Carter and Wegman [1977]. In fact, w.h.p., for a given set of size proportional to that of the space consumed, double tabulation gives fully-random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space. While these tabulation schemes are all easy to implement and use, their analysis is not

arXiv.org e-Print Archive

Copenhagen University Research Information System

Dagstuhl Research Online Publication Server

Fast hashing with Strong Concentration Bounds

Author: Aamand Anders
Bernstein Sergei Natanovich
Celis L. Elisa
Dahlgaard Søren
Dumey A. I.
Meka Raghu
Mitzenmacher Michael
şcu Mihai P
şcu Mihai P
şcu Mihai P
Publication venue
Publication date: 01/01/2020
Field of study

Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of

c=O(1)

characters, e.g., a 64-bit key as

c=8

characters of 8-bits. The character domain

\Sigma

should be small enough that character tables of size

|\Sigma|

fit in fast cache. The schemes then use

O(1)

tables of this size, so the space of tabulation hashing is

O(|\Sigma|)

. However, the concentration bounds by Patrascu and Thorup only apply if the expected sums are

\ll |\Sigma|

. To see the problem, consider the very simple case where we use tabulation hashing to throw

n

balls into

m

bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if

n=m

, for then the expected value is

1

. However, if

m=2

, as when tossing

n

unbiased coins, the expected value

n/2

\gg |\Sigma|

for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call \emph{tabulation-permutation} hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.Comment: 54 pages, 3 figures. An extended abstract appeared at the 52nd Annual ACM Symposium on Theory of Computing (STOC20

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Power of $d$ Choices with Simple Tabulation

Author: Aamand Anders
Knudsen Mathias Bæk Tejs
Thorup Mikkel
Publication venue
Publication date: 25/04/2018
Field of study

Suppose that we are to place

m

balls into

n

bins sequentially using the

d

-choice paradigm: For each ball we are given a choice of

d

bins, according to

d

hash functions

h_1,\dots,h_d

and we place the ball in the least loaded of these bins breaking ties arbitrarily. Our interest is in the number of balls in the fullest bin after all

m

balls have been placed. Azar et al. [STOC'94] proved that when

m=O(n)

and when the hash functions are fully random the maximum load is at most

\frac{\lg \lg n }{\lg d}+O(1)

whp (i.e. with probability

1-O(n^{-\gamma})

for any choice of

\gamma

). In this paper we suppose that the

h_1,\dots,h_d

are simple tabulation hash functions. Generalising a result by Dahlgaard et al [SODA'16] we show that for an arbitrary constant

d\geq 2

the maximum load is

O(\lg \lg n)

whp, and that expected maximum load is at most

\frac{\lg \lg n}{\lg d}+O(1)

. We further show that by using a simple tie-breaking algorithm introduced by V\"ocking [J.ACM'03] the expected maximum load drops to

\frac{\lg \lg n}{d\lg \varphi_d}+O(1)

where

\varphi_d

is the rate of growth of the

d

-ary Fibonacci numbers. Both of these expected bounds match those of the fully random setting. The analysis by Dahlgaard et al. relies on a proof by P\u{a}tra\c{s}cu and Thorup [J.ACM'11] concerning the use of simple tabulation for cuckoo hashing. We need here a generalisation to

d>2

hash functions, but the original proof is an 8-page tour de force of ad-hoc arguments that do not appear to generalise. Our main technical contribution is a shorter, simpler and more accessible proof of the result by P\u{a}tra\c{s}cu and Thorup, where the relevant parts generalise nicely to the analysis of

d

choices.Comment: Accepted at ICALP 201

arXiv.org e-Print Archive

Copenhagen University Research Information System

Load Balancing with Dynamic Set of Balls and Bins

Author: Aamand Anders
Knudsen Jakob Bæk Tejs
Thorup Mikkel
Publication venue
Publication date: 01/01/2021
Field of study

In dynamic load balancing, we wish to distribute balls into bins in an environment where both balls and bins can be added and removed. We want to minimize the maximum load of any bin but we also want to minimize the number of balls and bins affected when adding or removing a ball or a bin. We want a hashing-style solution where we given the ID of a ball can find its bin efficiently. We are given a balancing parameter

c=1+\epsilon

, where

\epsilon\in (0,1)

. With

n

and

m

the current numbers of balls and bins, we want no bin with load above

C=\lceil c n/m\rceil

, referred to as the capacity of the bins. We present a scheme where we can locate a ball checking

1+O(\log 1/\epsilon)

bins in expectation. When inserting or deleting a ball, we expect to move

O(1/\epsilon)

balls, and when inserting or deleting a bin, we expect to move

O(C/\epsilon)

balls. Previous bounds were off by a factor

1/\epsilon

. These bounds are best possible when

C=O(1)

but for larger

C

, we can do much better: Let

f=\epsilon C

C\leq \log 1/\epsilon

f=\epsilon\sqrt{C}\cdot \sqrt{\log(1/(\epsilon\sqrt{C}))}

\log 1/\epsilon\leq C<\tfrac{1}{2\epsilon^2}

, and

C=1

C\geq \tfrac{1}{2\epsilon^2}

. We show that we expect to move

O(1/f)

balls when inserting or deleting a ball, and

O(C/f)

balls when inserting or deleting a bin. For the bounds with larger

C

, we first have to resolve a much simpler probabilistic problem. Place

n

balls in

m

bins of capacity

C

, one ball at the time. Each ball picks a uniformly random non-full bin. We show that in expectation and with high probability, the fraction of non-full bins is

\Theta(f)

. Then the expected number of bins that a new ball would have to visit to find one that is not full is

\Theta(1/f)

. As it turns out, we obtain the same complexity in our more complicated scheme where both balls and bins can be added and removed.Comment: Accepted at STOC'2

arXiv.org e-Print Archive

Copenhagen University Research Information System