10 research outputs found

    On randomness in Hash functions

    Get PDF
    In the talk, we shall discuss quality measures for hash functions used in data structures and algorithms, and survey positive and negative results. (This talk is not about cryptographic hash functions.) For the analysis of algorithms involving hash functions, it is often convenient to assume the hash functions used behave fully randomly; in some cases there is no analysis known that avoids this assumption. In practice, one needs to get by with weaker hash functions that can be generated by randomized algorithms. A well-studied range of applications concern realizations of dynamic dictionaries (linear probing, chained hashing, dynamic perfect hashing, cuckoo hashing and its generalizations) or Bloom filters and their variants. A particularly successful and useful means of classification are Carter and Wegman's universal or k-wise independent classes, introduced in 1977. A natural and widely used approach to analyzing an algorithm involving hash functions is to show that it works if a sufficiently strong universal class of hash functions is used, and to substitute one of the known constructions of such classes. This invites research into the question of just how much independence in the hash functions is necessary for an algorithm to work. Some recent analyses that gave impossibility results constructed rather artificial classes that would not work; other results pointed out natural, widely used hash classes that would not work in a particular application. Only recently it was shown that under certain assumptions on some entropy present in the set of keys even 2-wise independent hash classes will lead to strong randomness properties in the hash values. The negative results show that these results may not be taken as justification for using weak hash classes indiscriminately, in particular for key sets with structure. When stronger independence properties are needed for a theoretical analysis, one may resort to classic constructions. Only in 2003 it was found out how full randomness can be simulated using only linear space overhead (which is optimal). The "split-and-share" approach can be used to justify the full randomness assumption in some situations in which full randomness is needed for the analysis to go through, like in many applications involving multiple hash functions (e.g., generalized versions of cuckoo hashing with multiple hash functions or larger bucket sizes, load balancing, Bloom filters and variants, or minimal perfect hash function constructions). For practice, efficiency considerations beyond constant factors are important. It is not hard to construct very efficient 2-wise independent classes. Using k-wise independent classes for constant k bigger than 3 has become feasible in practice only by new constructions involving tabulation. This goes together well with the quite new result that linear probing works with 5-independent hash functions. Recent developments suggest that the classification of hash function constructions by their degree of independence alone may not be adequate in some cases. Thus, one may want to analyze the behavior of specific hash classes in specific applications, circumventing the concept of k-wise independence. Several such results were recently achieved concerning hash functions that utilize tabulation. In particular if the analysis of the application involves using randomness properties in graphs and hypergraphs (generalized cuckoo hashing, also in the version with a "stash", or load balancing), a hash class combining k-wise independence with tabulation has turned out to be very powerful

    Fast and Powerful Hashing using Tabulation

    Get PDF
    Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Here we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees. Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as consisting of cc characters and we have precomputed character tables h1,...,hch_1,...,h_c mapping characters to random hash values. A key x=(x1,...,xc)x=(x_1,...,x_c) is hashed to h1[x1]⊕h2[x2].....⊕hc[xc]h_1[x_1] \oplus h_2[x_2].....\oplus h_c[x_c]. This schemes is very fast with character tables in cache. While simple tabulation is not even 4-independent, it does provide many of the guarantees that are normally obtained via higher independence, e.g., linear probing and Cuckoo hashing. Next we consider twisted tabulation where one input character is "twisted" in a simple way. The resulting hash function has powerful distributional properties: Chernoff-Hoeffding type tail bounds and a very small bias for min-wise hashing. This also yields an extremely fast pseudo-random number generator that is provably good for many classic randomized algorithms and data-structures. Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Carter and Wegman [1977]. In fact, w.h.p., for a given set of size proportional to that of the space consumed, double tabulation gives fully-random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space. While these tabulation schemes are all easy to implement and use, their analysis is not

    Strongly universal string hashing is fast

    Get PDF
    We present fast strongly universal string hashing families: they can process data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that these families---though they require a large buffer of random numbers---are often faster than popular hash functions with weaker theoretical guarantees. Moreover, conventional wisdom is that hash functions with fewer multiplications are faster. Yet we find that they may fail to be faster due to operation pipelining. We present experimental results on several processors including low-powered processors. Our tests include hash functions designed for processors with the Carry-Less Multiplication (CLMUL) instruction set. We also prove, using accessible proofs, the strong universality of our families.Comment: Software is available at http://code.google.com/p/variablelengthstringhashing/ and https://github.com/lemire/StronglyUniversalStringHashin

    A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing

    Get PDF
    ABSTRACT Hashing is a solved problem. It allows us to get constant time access for lookups. Hashing is also simple. It is safe to use an arbitrary method as a black box and expect good performance, and optimizations to hashing can only improve it by a negligible delta. Why are all of the previous statements plain wrong? That is what this paper is about. In this paper we thoroughly study hashing for integer keys and carefully analyze the most common hashing methods in a five-dimensional requirements space: () data-distribution, () load factor, () dataset size, () read/write-ratio, and () un/successfulratio. Each point in that design space may potentially suggest a different hashing scheme, and additionally also a different hash function. We show that a right or wrong decision in picking the right hashing scheme and hash function combination may lead to significant difference in performance. To substantiate this claim, we carefully analyze two additional dimensions: () five representative hashing schemes (which includes an improved variant of Robin Hood hashing), () four important classes of hash functions widely used today. That is, we consider 20 different combinations in total. Finally, we also provide a glimpse about the effect of table memory layout and the use of SIMD instructions. Our study clearly indicates that picking the right combination may have considerable impact on insert and lookup performance, as well as memory footprint. A major conclusion of our work is that hashing should be considered a white box before blindly using it in applications, such as query processing. Finally, we also provide a strong guideline about when to use which hashing method

    On risks of using cuckoo hashing with simple universal hash classes

    No full text
    Cuckoo hashing, introduced by Pagh and Rodler [10], is a dynamic dictionary data structure for storing a set S of n keys from a universe U, with constant lookup time and amortized expected constant insertion time. For the analysis, space (2+ε)n and Ω(log n)-wise independence of the hash functions is sufficient. In experiments mentioned in [10], several weaker hash classes worked well; however, a certain simple multiplicative hash family worked badly. In this paper, we prove that the failure probability is high when cuckoo hashing is run with the multiplicative class or with the very common class of linear hash functions over a prime field, even if space 4n is provided. The key set S is fully random, but it must be relatively dense in the universe U of all keys (like |S | ≥ |U | 11/12). The bad behavior and the fact that this effect depends on the density of S in U can also be observed in experiments. The result transfers to larger universes if the keys are chosen from a suitable smaller domain. Viewed from a different perspective, our result illustrates that care must be taken when applying a recent result of Mitzenmacher and Vadhan ([12], SODA 2008) proving good behavior of universal hash classes in combination with key sets that have some entropy. Their result is applicable to cuckoo hashing. A technical hypothesis in [12], namely the assumption that either the “collision probability ” or the “maximum probability ” is small, translates into the condition that |S | is relatively small in comparison to |U|. Our result shows that the result from [12] on 2-universal classes ceases to hold if |S|/|U | is not small enough, even for very common 2-universal hash classes and fully random key sets.

    On Risks of Using Cuckoo Hashing with Simple Universal Hash Classes

    No full text

    15th Scandinavian Symposium and Workshops on Algorithm Theory: SWAT 2016, June 22-24, 2016, Reykjavik, Iceland

    Get PDF
    corecore