9,363 research outputs found
Fast and Powerful Hashing using Tabulation
Randomized algorithms are often enjoyed for their simplicity, but the hash
functions employed to yield the desired probabilistic guarantees are often too
complicated to be practical. Here we survey recent results on how simple
hashing schemes based on tabulation provide unexpectedly strong guarantees.
Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as
consisting of characters and we have precomputed character tables
mapping characters to random hash values. A key
is hashed to . This schemes is
very fast with character tables in cache. While simple tabulation is not even
4-independent, it does provide many of the guarantees that are normally
obtained via higher independence, e.g., linear probing and Cuckoo hashing.
Next we consider twisted tabulation where one input character is "twisted" in
a simple way. The resulting hash function has powerful distributional
properties: Chernoff-Hoeffding type tail bounds and a very small bias for
min-wise hashing. This also yields an extremely fast pseudo-random number
generator that is provably good for many classic randomized algorithms and
data-structures.
Finally, we consider double tabulation where we compose two simple tabulation
functions, applying one to the output of the other, and show that this yields
very high independence in the classic framework of Carter and Wegman [1977]. In
fact, w.h.p., for a given set of size proportional to that of the space
consumed, double tabulation gives fully-random hashing. We also mention some
more elaborate tabulation schemes getting near-optimal independence for given
time and space.
While these tabulation schemes are all easy to implement and use, their
analysis is not
Fast Exact Search in Hamming Space with Multi-Index Hashing
There is growing interest in representing image data and feature descriptors
using compact binary codes for fast near neighbor search. Although binary codes
are motivated by their use as direct indices (addresses) into a hash table,
codes longer than 32 bits are not being used as such, as it was thought to be
ineffective. We introduce a rigorous way to build multiple hash tables on
binary code substrings that enables exact k-nearest neighbor search in Hamming
space. The approach is storage efficient and straightforward to implement.
Theoretical analysis shows that the algorithm exhibits sub-linear run-time
behavior for uniformly distributed codes. Empirical results show dramatic
speedups over a linear scan baseline for datasets of up to one billion codes of
64, 128, or 256 bits
More Analysis of Double Hashing for Balanced Allocations
With double hashing, for a key , one generates two hash values and
, and then uses combinations for
to generate multiple hash values in the range from the initial two.
For balanced allocations, keys are hashed into a hash table where each bucket
can hold multiple keys, and each key is placed in the least loaded of
choices. It has been shown previously that asymptotically the performance of
double hashing and fully random hashing is the same in the balanced allocation
paradigm using fluid limit methods. Here we extend a coupling argument used by
Lueker and Molodowitch to show that double hashing and ideal uniform hashing
are asymptotically equivalent in the setting of open address hash tables to the
balanced allocation setting, providing further insight into this phenomenon. We
also discuss the potential for and bottlenecks limiting the use this approach
for other multiple choice hashing schemes.Comment: 13 pages ; current draft ; will be submitted to conference shortl
Unconditional security from noisy quantum storage
We consider the implementation of two-party cryptographic primitives based on
the sole assumption that no large-scale reliable quantum storage is available
to the cheating party. We construct novel protocols for oblivious transfer and
bit commitment, and prove that realistic noise levels provide security even
against the most general attack. Such unconditional results were previously
only known in the so-called bounded-storage model which is a special case of
our setting. Our protocols can be implemented with present-day hardware used
for quantum key distribution. In particular, no quantum storage is required for
the honest parties.Comment: 25 pages (IEEE two column), 13 figures, v4: published version (to
appear in IEEE Transactions on Information Theory), including bit wise
min-entropy sampling. however, for experimental purposes block sampling can
be much more convenient, please see v3 arxiv version if needed. See
arXiv:0911.2302 for a companion paper addressing aspects of a practical
implementation using block samplin
Efficient Gauss Elimination for Near-Quadratic Matrices with One Short Random Block per Row, with Applications
In this paper we identify a new class of sparse near-quadratic random Boolean matrices that have full row rank over F_2 = {0,1} with high probability and can be transformed into echelon form in almost linear time by a simple version of Gauss elimination. The random matrix with dimensions n(1-epsilon) x n is generated as follows: In each row, identify a block of length L = O((log n)/epsilon) at a random position. The entries outside the block are 0, the entries inside the block are given by fair coin tosses. Sorting the rows according to the positions of the blocks transforms the matrix into a kind of band matrix, on which, as it turns out, Gauss elimination works very efficiently with high probability. For the proof, the effects of Gauss elimination are interpreted as a ("coin-flipping") variant of Robin Hood hashing, whose behaviour can be captured in terms of a simple Markov model from queuing theory. Bounds for expected construction time and high success probability follow from results in this area. They readily extend to larger finite fields in place of F_2.
By employing hashing, this matrix family leads to a new implementation of a retrieval data structure, which represents an arbitrary function f: S -> {0,1} for some set S of m = (1-epsilon)n keys. It requires m/(1-epsilon) bits of space, construction takes O(m/epsilon^2) expected time on a word RAM, while queries take O(1/epsilon) time and access only one contiguous segment of O((log m)/epsilon) bits in the representation (O(1/epsilon) consecutive words on a word RAM). The method is readily implemented and highly practical, and it is competitive with state-of-the-art methods. In a more theoretical variant, which works only for unrealistically large S, we can even achieve construction time O(m/epsilon) and query time O(1), accessing O(1) contiguous memory words for a query. By well-established methods the retrieval data structure leads to efficient constructions of (static) perfect hash functions and (static) Bloom filters with almost optimal space and very local storage access patterns for queries
- …