8,072 research outputs found
Regular and almost universal hashing: an efficient implementation
Random hashing can provide guarantees regarding the performance of data
structures such as hash tables---even in an adversarial setting. Many existing
families of hash functions are universal: given two data objects, the
probability that they have the same hash value is low given that we pick hash
functions at random. However, universality fails to ensure that all hash
functions are well behaved. We further require regularity: when picking data
objects at random they should have a low probability of having the same hash
value, for any fixed hash function. We present the efficient implementation of
a family of non-cryptographic hash functions (PM+) offering good running times,
good memory usage as well as distinguishing theoretical guarantees: almost
universality and component-wise regularity. On a variety of platforms, our
implementations are comparable to the state of the art in performance. On
recent Intel processors, PM+ achieves a speed of 4.7 bytes per cycle for 32-bit
outputs and 3.3 bytes per cycle for 64-bit outputs. We review vectorization
through SIMD instructions (e.g., AVX2) and optimizations for superscalar
execution.Comment: accepted for publication in Software: Practice and Experience in
September 201
Simple, compact and robust approximate string dictionary
This paper is concerned with practical implementations of approximate string
dictionaries that allow edit errors. In this problem, we have as input a
dictionary of strings of total length over an alphabet of size
. Given a bound and a pattern of length , a query has to
return all the strings of the dictionary which are at edit distance at most
from , where the edit distance between two strings and is defined as
the minimum-cost sequence of edit operations that transform into . The
cost of a sequence of operations is defined as the sum of the costs of the
operations involved in the sequence. In this paper, we assume that each of
these operations has unit cost and consider only three operations: deletion of
one character, insertion of one character and substitution of a character by
another. We present a practical implementation of the data structure we
recently proposed and which works only for one error. We extend the scheme to
. Our implementation has many desirable properties: it has a very
fast and space-efficient building algorithm. The dictionary data structure is
compact and has fast and robust query time. Finally our data structure is
simple to implement as it only uses basic techniques from the literature,
mainly hashing (linear probing and hash signatures) and succinct data
structures (bitvectors supporting rank queries).Comment: Accepted to a journal (19 pages, 2 figures
Strongly universal string hashing is fast
We present fast strongly universal string hashing families: they can process
data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that
these families---though they require a large buffer of random numbers---are
often faster than popular hash functions with weaker theoretical guarantees.
Moreover, conventional wisdom is that hash functions with fewer multiplications
are faster. Yet we find that they may fail to be faster due to operation
pipelining. We present experimental results on several processors including
low-powered processors. Our tests include hash functions designed for
processors with the Carry-Less Multiplication (CLMUL) instruction set. We also
prove, using accessible proofs, the strong universality of our families.Comment: Software is available at
http://code.google.com/p/variablelengthstringhashing/ and
https://github.com/lemire/StronglyUniversalStringHashin
Recommended from our members
GPERF : a perfect hash function generator
gperf is a widely available perfect hash function generator written in C++. It automates a common system software operation: keyword recognition. gperf translates an n element user-specified keyword list keyfile into source code containing a k element lookup table and a pair of functions, phash and in_word_set. phash uniquely maps keywords in keyfile onto the range 0 .. k - 1, where k >/= n. If k = n, then phash is considered a minimal perfect hash function. in_word_set uses phash to determine whether a particular string of characters str occurs in the keyfile, using at most one string comparison.This paper describes the user-interface, options, features, algorithm design and implementation strategies incorporated in gperf. It also presents the results from an empirical comparison between gperf-generated recognizers and other popular techniques for reserved word lookup
Postprocessing for quantum random number generators: entropy evaluation and randomness extraction
Quantum random-number generators (QRNGs) can offer a means to generate
information-theoretically provable random numbers, in principle. In practice,
unfortunately, the quantum randomness is inevitably mixed with classical
randomness due to classical noises. To distill this quantum randomness, one
needs to quantify the randomness of the source and apply a randomness
extractor. Here, we propose a generic framework for evaluating quantum
randomness of real-life QRNGs by min-entropy, and apply it to two different
existing quantum random-number systems in the literature. Moreover, we provide
a guideline of QRNG data postprocessing for which we implement two
information-theoretically provable randomness extractors: Toeplitz-hashing
extractor and Trevisan's extractor.Comment: 13 pages, 2 figure
The universality of iterated hashing over variable-length strings
Iterated hash functions process strings recursively, one character at a time.
At each iteration, they compute a new hash value from the preceding hash value
and the next character. We prove that iterated hashing can be pairwise
independent, but never 3-wise independent. We show that it can be almost
universal over strings much longer than the number of hash values; we bound the
maximal string length given the collision probability
- …