1,035 research outputs found
Strongly universal string hashing is fast
We present fast strongly universal string hashing families: they can process
data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that
these families---though they require a large buffer of random numbers---are
often faster than popular hash functions with weaker theoretical guarantees.
Moreover, conventional wisdom is that hash functions with fewer multiplications
are faster. Yet we find that they may fail to be faster due to operation
pipelining. We present experimental results on several processors including
low-powered processors. Our tests include hash functions designed for
processors with the Carry-Less Multiplication (CLMUL) instruction set. We also
prove, using accessible proofs, the strong universality of our families.Comment: Software is available at
http://code.google.com/p/variablelengthstringhashing/ and
https://github.com/lemire/StronglyUniversalStringHashin
Regular and almost universal hashing: an efficient implementation
Random hashing can provide guarantees regarding the performance of data
structures such as hash tables---even in an adversarial setting. Many existing
families of hash functions are universal: given two data objects, the
probability that they have the same hash value is low given that we pick hash
functions at random. However, universality fails to ensure that all hash
functions are well behaved. We further require regularity: when picking data
objects at random they should have a low probability of having the same hash
value, for any fixed hash function. We present the efficient implementation of
a family of non-cryptographic hash functions (PM+) offering good running times,
good memory usage as well as distinguishing theoretical guarantees: almost
universality and component-wise regularity. On a variety of platforms, our
implementations are comparable to the state of the art in performance. On
recent Intel processors, PM+ achieves a speed of 4.7 bytes per cycle for 32-bit
outputs and 3.3 bytes per cycle for 64-bit outputs. We review vectorization
through SIMD instructions (e.g., AVX2) and optimizations for superscalar
execution.Comment: accepted for publication in Software: Practice and Experience in
September 201
The universality of iterated hashing over variable-length strings
Iterated hash functions process strings recursively, one character at a time.
At each iteration, they compute a new hash value from the preceding hash value
and the next character. We prove that iterated hashing can be pairwise
independent, but never 3-wise independent. We show that it can be almost
universal over strings much longer than the number of hash values; we bound the
maximal string length given the collision probability
On perfect hashing of numbers with sparse digit representation via multiplication by a constant
Consider the set of vectors over a field having non-zero coefficients only in
a fixed sparse set and multiplication defined by convolution, or the set of
integers having non-zero digits (in some base ) in a fixed sparse set. We
show the existence of an optimal (resp. almost-optimal in the latter case)
`magic' multiplier constant that provides a perfect hash function which
transfers the information from the given sparse coefficients into consecutive
digits. Studying the convolution case we also obtain a result of non-degeneracy
for Schur functions as polynomials in the elementary symmetric functions in
positive characteristic.Comment: 5 page
Fast and Powerful Hashing using Tabulation
Randomized algorithms are often enjoyed for their simplicity, but the hash
functions employed to yield the desired probabilistic guarantees are often too
complicated to be practical. Here we survey recent results on how simple
hashing schemes based on tabulation provide unexpectedly strong guarantees.
Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as
consisting of characters and we have precomputed character tables
mapping characters to random hash values. A key
is hashed to . This schemes is
very fast with character tables in cache. While simple tabulation is not even
4-independent, it does provide many of the guarantees that are normally
obtained via higher independence, e.g., linear probing and Cuckoo hashing.
Next we consider twisted tabulation where one input character is "twisted" in
a simple way. The resulting hash function has powerful distributional
properties: Chernoff-Hoeffding type tail bounds and a very small bias for
min-wise hashing. This also yields an extremely fast pseudo-random number
generator that is provably good for many classic randomized algorithms and
data-structures.
Finally, we consider double tabulation where we compose two simple tabulation
functions, applying one to the output of the other, and show that this yields
very high independence in the classic framework of Carter and Wegman [1977]. In
fact, w.h.p., for a given set of size proportional to that of the space
consumed, double tabulation gives fully-random hashing. We also mention some
more elaborate tabulation schemes getting near-optimal independence for given
time and space.
While these tabulation schemes are all easy to implement and use, their
analysis is not
Fast evaluation of union-intersection expressions
We show how to represent sets in a linear space data structure such that
expressions involving unions and intersections of sets can be computed in a
worst-case efficient way. This problem has applications in e.g. information
retrieval and database systems. We mainly consider the RAM model of
computation, and sets of machine words, but also state our results in the I/O
model. On a RAM with word size , a special case of our result is that the
intersection of (preprocessed) sets, containing elements in total, can
be computed in expected time , where is the
number of elements in the intersection. If the first of the two terms
dominates, this is a factor faster than the standard solution of
merging sorted lists. We show a cell probe lower bound of time , meaning that our upper bound is nearly
optimal for small . Our algorithm uses a novel combination of approximate
set representations and word-level parallelism
On randomness in Hash functions
In the talk, we shall discuss quality measures for hash functions used in data structures and algorithms, and survey positive and negative results. (This talk is not about cryptographic hash functions.) For the analysis of algorithms involving hash functions, it is often convenient to assume the hash functions used behave fully randomly; in some cases there is no analysis known that avoids this assumption. In practice, one needs to get by with weaker hash functions that can be generated by randomized algorithms. A well-studied range of applications concern realizations of dynamic dictionaries (linear probing, chained hashing, dynamic perfect hashing, cuckoo hashing and its generalizations) or Bloom filters and their variants. A particularly successful and useful means of classification are Carter and Wegman's universal or k-wise independent classes, introduced in 1977. A natural and widely used approach to analyzing an algorithm involving hash functions is to show that it works if a sufficiently strong universal class of hash functions is used, and to substitute one of the known constructions of such classes. This invites research into the question of just how much independence in the hash functions is necessary for an algorithm to work. Some recent analyses that gave impossibility results constructed rather artificial classes that would not work; other results pointed out natural, widely used hash classes that would not work in a particular application. Only recently it was shown that under certain assumptions on some entropy present in the set of keys even 2-wise independent hash classes will lead to strong randomness properties in the hash values. The negative results show that these results may not be taken as justification for using weak hash classes indiscriminately, in particular for key sets with structure. When stronger independence properties are needed for a theoretical analysis, one may resort to classic constructions. Only in 2003 it was found out how full randomness can be simulated using only linear space overhead (which is optimal). The "split-and-share" approach can be used to justify the full randomness assumption in some situations in which full randomness is needed for the analysis to go through, like in many applications involving multiple hash functions (e.g., generalized versions of cuckoo hashing with multiple hash functions or larger bucket sizes, load balancing, Bloom filters and variants, or minimal perfect hash function constructions). For practice, efficiency considerations beyond constant factors are important. It is not hard to construct very efficient 2-wise independent classes. Using k-wise independent classes for constant k bigger than 3 has become feasible in practice only by new constructions involving tabulation. This goes together well with the quite new result that linear probing works with 5-independent hash functions. Recent developments suggest that the classification of hash function constructions by their degree of independence alone may not be adequate in some cases. Thus, one may want to analyze the behavior of specific hash classes in specific applications, circumventing the concept of k-wise independence. Several such results were recently achieved concerning hash functions that utilize tabulation. In particular if the analysis of the application involves using randomness properties in graphs and hypergraphs (generalized cuckoo hashing, also in the version with a "stash", or load balancing), a hash class combining k-wise independence with tabulation has turned out to be very powerful
- …