37 research outputs found
On the insertion time of random walk cuckoo hashing
Cuckoo Hashing is a hashing scheme invented by Pagh and Rodler. It uses
distinct hash functions to insert items into the hash table. It has
been an open question for some time as to the expected time for Random Walk
Insertion to add items. We show that if the number of hash functions
is sufficiently large, then the expected insertion time is per item.Comment: 9 page
Tight Thresholds for Cuckoo Hashing via XORSAT
We settle the question of tight thresholds for offline cuckoo hashing. The
problem can be stated as follows: we have n keys to be hashed into m buckets
each capable of holding a single key. Each key has k >= 3 (distinct) associated
buckets chosen uniformly at random and independently of the choices of other
keys. A hash table can be constructed successfully if each key can be placed
into one of its buckets. We seek thresholds alpha_k such that, as n goes to
infinity, if n/m <= alpha for some alpha < alpha_k then a hash table can be
constructed successfully with high probability, and if n/m >= alpha for some
alpha > alpha_k a hash table cannot be constructed successfully with high
probability. Here we are considering the offline version of the problem, where
all keys and hash values are given, so the problem is equivalent to previous
models of multiple-choice hashing. We find the thresholds for all values of k >
2 by showing that they are in fact the same as the previously known thresholds
for the random k-XORSAT problem. We then extend these results to the setting
where keys can have differing number of choices, and provide evidence in the
form of an algorithm for a conjecture extending this result to cuckoo hash
tables that store multiple keys in a bucket.Comment: Revision 3 contains missing details of proofs, as appendix
Fast Scalable Construction of (Minimal Perfect Hash) Functions
Recent advances in random linear systems on finite fields have paved the way
for the construction of constant-time data structures representing static
functions and minimal perfect hash functions using less space with respect to
existing techniques. The main obstruction for any practical application of
these results is the cubic-time Gaussian elimination required to solve these
linear systems: despite they can be made very small, the computation is still
too slow to be feasible.
In this paper we describe in detail a number of heuristics and programming
techniques to speed up the resolution of these systems by several orders of
magnitude, making the overall construction competitive with the standard and
widely used MWHC technique, which is based on hypergraph peeling. In
particular, we introduce broadword programming techniques for fast equation
manipulation and a lazy Gaussian elimination algorithm. We also describe a
number of technical improvements to the data structure which further reduce
space usage and improve lookup speed.
Our implementation of these techniques yields a minimal perfect hash function
data structure occupying 2.24 bits per element, compared to 2.68 for MWHC-based
ones, and a static function data structure which reduces the multiplicative
overhead from 1.23 to 1.03
On randomness in Hash functions
In the talk, we shall discuss quality measures for hash functions used in data structures and algorithms, and survey positive and negative results. (This talk is not about cryptographic hash functions.) For the analysis of algorithms involving hash functions, it is often convenient to assume the hash functions used behave fully randomly; in some cases there is no analysis known that avoids this assumption. In practice, one needs to get by with weaker hash functions that can be generated by randomized algorithms. A well-studied range of applications concern realizations of dynamic dictionaries (linear probing, chained hashing, dynamic perfect hashing, cuckoo hashing and its generalizations) or Bloom filters and their variants. A particularly successful and useful means of classification are Carter and Wegman's universal or k-wise independent classes, introduced in 1977. A natural and widely used approach to analyzing an algorithm involving hash functions is to show that it works if a sufficiently strong universal class of hash functions is used, and to substitute one of the known constructions of such classes. This invites research into the question of just how much independence in the hash functions is necessary for an algorithm to work. Some recent analyses that gave impossibility results constructed rather artificial classes that would not work; other results pointed out natural, widely used hash classes that would not work in a particular application. Only recently it was shown that under certain assumptions on some entropy present in the set of keys even 2-wise independent hash classes will lead to strong randomness properties in the hash values. The negative results show that these results may not be taken as justification for using weak hash classes indiscriminately, in particular for key sets with structure. When stronger independence properties are needed for a theoretical analysis, one may resort to classic constructions. Only in 2003 it was found out how full randomness can be simulated using only linear space overhead (which is optimal). The "split-and-share" approach can be used to justify the full randomness assumption in some situations in which full randomness is needed for the analysis to go through, like in many applications involving multiple hash functions (e.g., generalized versions of cuckoo hashing with multiple hash functions or larger bucket sizes, load balancing, Bloom filters and variants, or minimal perfect hash function constructions). For practice, efficiency considerations beyond constant factors are important. It is not hard to construct very efficient 2-wise independent classes. Using k-wise independent classes for constant k bigger than 3 has become feasible in practice only by new constructions involving tabulation. This goes together well with the quite new result that linear probing works with 5-independent hash functions. Recent developments suggest that the classification of hash function constructions by their degree of independence alone may not be adequate in some cases. Thus, one may want to analyze the behavior of specific hash classes in specific applications, circumventing the concept of k-wise independence. Several such results were recently achieved concerning hash functions that utilize tabulation. In particular if the analysis of the application involves using randomness properties in graphs and hypergraphs (generalized cuckoo hashing, also in the version with a "stash", or load balancing), a hash class combining k-wise independence with tabulation has turned out to be very powerful
Load thresholds for cuckoo hashing with overlapping blocks
Dietzfelbinger and Weidling [DW07] proposed a natural variation of cuckoo
hashing where each of objects is assigned intervals of size
in a linear (or cyclic) hash table of size and both start points are chosen
independently and uniformly at random. Each object must be placed into a table
cell within its intervals, but each cell can only hold one object. Experiments
suggested that this scheme outperforms the variant with blocks in which
intervals are aligned at multiples of . In particular, the load threshold
is higher, i.e. the load that can be achieved with high probability. For
instance, Lehman and Panigrahy [LP09] empirically observed the threshold for
to be around as compared to roughly using blocks.
They managed to pin down the asymptotics of the thresholds for large ,
but the precise values resisted rigorous analysis.
We establish a method to determine these load thresholds for all , and, in fact, for general . For instance, for we
get . The key tool we employ is an insightful and general
theorem due to Leconte, Lelarge, and Massouli\'e [LLM13], which adapts methods
from statistical physics to the world of hypergraph orientability. In effect,
the orientability thresholds for our graph families are determined by belief
propagation equations for certain graph limits. As a side note we provide
experimental evidence suggesting that placements can be constructed in linear
time with loads close to the threshold using an adapted version of an algorithm
by Khosla [Kho13]