523 research outputs found
Sliding Block Hashing (Slick) -- Basic Algorithmic Ideas
We present {\bf Sli}ding Blo{\bf ck} Hashing (Slick), a simple hash table
data structure that combines high performance with very good space efficiency.
This preliminary report outlines avenues for analysis and implementation that
we intend to pursue
ShockHash: Towards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force
A minimal perfect hash function (MPHF) maps a set of keys to the
first integers without collisions. There is a lower bound of
bits of space needed to represent an MPHF. A matching
upper bound is obtained using the brute-force algorithm that tries random hash
functions until stumbling on an MPHF and stores that function's seed. In
expectation, seeds need to be tested. The most
space-efficient previous algorithms for constructing MPHFs all use such a
brute-force approach as a basic building block.
In this paper, we introduce ShockHash - Small, heavily overloaded cuckoo hash
tables. ShockHash uses two hash functions and , hoping for the
existence of a function such that is an MPHF on . In graph terminology, ShockHash generates
-edge random graphs until stumbling on a pseudoforest - a graph where each
component contains as many edges as nodes. Using cuckoo hashing, ShockHash then
derives an MPHF from the pseudoforest in linear time. It uses a 1-bit retrieval
data structure to store using bits.
By carefully analyzing the probability that a random graph is a pseudoforest,
we show that ShockHash needs to try only hash
function seeds in expectation, reducing the space for storing the seed by
roughly bits. This makes ShockHash almost a factor faster than
brute-force, while maintaining the asymptotically optimal space consumption. An
implementation within the RecSplit framework yields the currently most space
efficient MPHFs, i.e., competing approaches need about two orders of magnitude
more work to achieve the same space
SicHash -- Small Irregular Cuckoo Tables for Perfect Hashing
A Perfect Hash Function (PHF) is a hash function that has no collisions on a
given input set. PHFs can be used for space efficient storage of data in an
array, or for determining a compact representative of each object in the set.
In this paper, we present the PHF construction algorithm SicHash - Small
Irregular Cuckoo Tables for Perfect Hashing. At its core, SicHash uses a known
technique: It places objects in a cuckoo hash table and then stores the final
hash function choice of each object in a retrieval data structure. We combine
the idea with irregular cuckoo hashing, where each object has a different
number of hash functions. Additionally, we use many small tables that we
overload beyond their asymptotic maximum load factor. The most space efficient
competitors often use brute force methods to determine the PHFs. SicHash
provides a more direct construction algorithm that only rarely needs to
recompute parts. Our implementation improves the state of the art in terms of
space usage versus construction time for a wide range of configurations. At the
same time, it provides very fast queries
Bipartite ShockHash: Pruning ShockHash Search for Efficient Perfect Hashing
A minimal perfect hash function (MPHF) maps a set of n keys to the first n
integers without collisions. Representing this bijection needs at least
bits per key, and there is a wide range of practical
implementations achieving about 2 bits per key. Minimal perfect hashing is a
key ingredient in many compact data structures such as updatable retrieval data
structures and approximate membership data structures.
A simple implementation reaching the space lower bound is to sample random
hash functions using brute-force, which needs about tries
in expectation. ShockHash recently reduced that to about tries in expectation by sampling random graphs. With bipartite
ShockHash, we now sample random bipartite graphs. In this paper, we describe
the general algorithmic ideas of bipartite ShockHash and give an experimental
evaluation. The key insight is that we can try all combinations of two hash
functions, each mapping into one half of the output range. This reduces the
number of sampled hash functions to only about
in expectation. In itself, this does not reduce the asymptotic running time
much because all combinations still need to be tested. However, by filtering
the candidates before combining them, we can reduce this to less than
combinations in expectation.
Our implementation of bipartite ShockHash is up to 3 orders of magnitude
faster than original ShockHash. Inside the RecSplit framework, bipartite
ShockHash-RS enables significantly larger base cases, leading to a construction
that is, depending on the allotted space budget, up to 20 times faster. In our
most extreme configuration, ShockHash-RS can build an MPHF for 10 million keys
with 1.489 bits per key (within 3.3% of the lower bound) in about half an hour,
pushing the limits of what is possible
High Performance Construction of RecSplit Based Minimal Perfect Hash Functions
A minimal perfect hash function (MPHF) bijectively maps a set S of objects to the first |S| integers. It can be used as a building block in databases and data compression. RecSplit [Esposito et al., ALENEX\u2720] is currently the most space efficient practical minimal perfect hash function. It heavily relies on trying out hash functions in a brute force way.
We introduce rotation fitting, a new technique that makes the search more efficient by drastically reducing the number of tried hash functions. Additionally, we greatly improve the construction time of RecSplit by harnessing parallelism on the level of bits, vectors, cores, and GPUs.
In combination, the resulting improvements yield speedups up to 239 on an 8-core CPU and up to 5438 using a GPU. The original single-threaded RecSplit implementation needs 1.5 hours to construct an MPHF for 5 Million objects with 1.56 bits per object. On the GPU, we achieve the same space usage in just 5 seconds. Given that the speedups are larger than the increase in energy consumption, our implementation is more energy efficient than the original implementation
High Performance Construction of RecSplit Based Minimal Perfect Hash Functions
A minimal perfect hash function (MPHF) bijectively maps a set S of objects to the first |S| integers. It can be used as a building block in databases and data compression. RecSplit [Esposito et al., ALENEX\u2720] is currently the most space efficient practical minimal perfect hash function. It heavily relies on trying out hash functions in a brute force way.
We introduce rotation fitting, a new technique that makes the search more efficient by drastically reducing the number of tried hash functions. Additionally, we greatly improve the construction time of RecSplit by harnessing parallelism on the level of bits, vectors, cores, and GPUs.
In combination, the resulting improvements yield speedups up to 239 on an 8-core CPU and up to 5438 using a GPU. The original single-threaded RecSplit implementation needs 1.5 hours to construct an MPHF for 5 Million objects with 1.56 bits per object. On the GPU, we achieve the same space usage in just 5 seconds. Given that the speedups are larger than the increase in energy consumption, our implementation is more energy efficient than the original implementation
Learned Monotone Minimal Perfect Hashing
A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms.
In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of LeMonHash is surprisingly simple and effective: we learn a monotone mapping from keys to their rank via an error-bounded piecewise linear model (the PGM-index), and then we solve the collisions that might arise among keys mapping to the same rank estimate by associating small integers with them in a retrieval data structure (BuRR). On synthetic random datasets, LeMonHash needs 34% less space than the next larger competitor, while achieving about 16 times faster queries. On real-world datasets, the space usage is very close to or much better than the best competitors, while achieving up to 19 times faster queries than the next larger competitor. As far as the construction of LeMonHash is concerned, we get an improvement by a factor of up to 2, compared to the competitor with the next best space usage.
We also investigate the case of keys being variable-length strings, introducing the so-called LeMonHash-VL: it needs space within 13% of the best competitors while achieving up to 3 times faster queries than the next larger competitor
Learned Monotone Minimal Perfect Hashing
A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of
keys is a function that maps each key in S to its rank. On keys not in S, the
function returns an arbitrary value. Applications range from databases, search
engines, data encryption, to pattern-matching algorithms.
In this paper, we describe LeMonHash, a new technique for constructing MMPHFs
for integers. The core idea of LeMonHash is surprisingly simple and effective:
we learn a monotone mapping from keys to their rank via an error-bounded
piecewise linear model (the PGM-index), and then we solve the collisions that
might arise among keys mapping to the same rank estimate by associating small
integers with them in a retrieval data structure (BuRR). On synthetic random
datasets, LeMonHash needs 35% less space than the next best competitor, while
achieving about 16 times faster queries. On real-world datasets, the space
usage is very close to or much better than the best competitors, while
achieving up to 19 times faster queries than the next larger competitor. As far
as the construction of LeMonHash is concerned, we get an improvement by a
factor of up to 2, compared to the competitor with the next best space usage.
We also investigate the case of keys being variable-length strings,
introducing the so-called LeMonHash-VL: it needs space within 10% of the best
competitors while achieving up to 3 times faster queries
Learned Monotone Minimal Perfect Hashing
A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms.
In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of LeMonHash is surprisingly simple and effective: we learn a monotone mapping from keys to their rank via an error-bounded piecewise linear model (the PGM-index), and then we solve the collisions that might arise among keys mapping to the same rank estimate by associating small integers with them in a retrieval data structure (BuRR). On synthetic random datasets, LeMonHash needs 34% less space than the next larger competitor, while achieving about 16 times faster queries. On real-world datasets, the space usage is very close to or much better than the best competitors, while achieving up to 19 times faster queries than the next larger competitor. As far as the construction of LeMonHash is concerned, we get an improvement by a factor of up to 2, compared to the competitor with the next best space usage.
We also investigate the case of keys being variable-length strings, introducing the so-called LeMonHash-VL: it needs space within 13% of the best competitors while achieving up to 3 times faster queries than the next larger competitor
- …