3,114 research outputs found
Cache-Oblivious Peeling of Random Hypergraphs
The computation of a peeling order in a randomly generated hypergraph is the
most time-consuming step in a number of constructions, such as perfect hashing
schemes, random -SAT solvers, error-correcting codes, and approximate set
encodings. While there exists a straightforward linear time algorithm, its poor
I/O performance makes it impractical for hypergraphs whose size exceeds the
available internal memory.
We show how to reduce the computation of a peeling order to a small number of
sequential scans and sorts, and analyze its I/O complexity in the
cache-oblivious model. The resulting algorithm requires
I/Os and time to peel a random hypergraph with edges.
We experimentally evaluate the performance of our implementation of this
algorithm in a real-world scenario by using the construction of minimal perfect
hash functions (MPHF) as our test case: our algorithm builds a MPHF of
billion keys in less than hours on a single machine. The resulting data
structure is both more space-efficient and faster than that obtained with the
current state-of-the-art MPHF construction for large-scale key sets
Fast Scalable Construction of (Minimal Perfect Hash) Functions
Recent advances in random linear systems on finite fields have paved the way
for the construction of constant-time data structures representing static
functions and minimal perfect hash functions using less space with respect to
existing techniques. The main obstruction for any practical application of
these results is the cubic-time Gaussian elimination required to solve these
linear systems: despite they can be made very small, the computation is still
too slow to be feasible.
In this paper we describe in detail a number of heuristics and programming
techniques to speed up the resolution of these systems by several orders of
magnitude, making the overall construction competitive with the standard and
widely used MWHC technique, which is based on hypergraph peeling. In
particular, we introduce broadword programming techniques for fast equation
manipulation and a lazy Gaussian elimination algorithm. We also describe a
number of technical improvements to the data structure which further reduce
space usage and improve lookup speed.
Our implementation of these techniques yields a minimal perfect hash function
data structure occupying 2.24 bits per element, compared to 2.68 for MWHC-based
ones, and a static function data structure which reduces the multiplicative
overhead from 1.23 to 1.03
ShockHash: Towards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force
A minimal perfect hash function (MPHF) maps a set of keys to the
first integers without collisions. There is a lower bound of
bits of space needed to represent an MPHF. A matching
upper bound is obtained using the brute-force algorithm that tries random hash
functions until stumbling on an MPHF and stores that function's seed. In
expectation, seeds need to be tested. The most
space-efficient previous algorithms for constructing MPHFs all use such a
brute-force approach as a basic building block.
In this paper, we introduce ShockHash - Small, heavily overloaded cuckoo hash
tables. ShockHash uses two hash functions and , hoping for the
existence of a function such that is an MPHF on . In graph terminology, ShockHash generates
-edge random graphs until stumbling on a pseudoforest - a graph where each
component contains as many edges as nodes. Using cuckoo hashing, ShockHash then
derives an MPHF from the pseudoforest in linear time. It uses a 1-bit retrieval
data structure to store using bits.
By carefully analyzing the probability that a random graph is a pseudoforest,
we show that ShockHash needs to try only hash
function seeds in expectation, reducing the space for storing the seed by
roughly bits. This makes ShockHash almost a factor faster than
brute-force, while maintaining the asymptotically optimal space consumption. An
implementation within the RecSplit framework yields the currently most space
efficient MPHFs, i.e., competing approaches need about two orders of magnitude
more work to achieve the same space
SicHash -- Small Irregular Cuckoo Tables for Perfect Hashing
A Perfect Hash Function (PHF) is a hash function that has no collisions on a
given input set. PHFs can be used for space efficient storage of data in an
array, or for determining a compact representative of each object in the set.
In this paper, we present the PHF construction algorithm SicHash - Small
Irregular Cuckoo Tables for Perfect Hashing. At its core, SicHash uses a known
technique: It places objects in a cuckoo hash table and then stores the final
hash function choice of each object in a retrieval data structure. We combine
the idea with irregular cuckoo hashing, where each object has a different
number of hash functions. Additionally, we use many small tables that we
overload beyond their asymptotic maximum load factor. The most space efficient
competitors often use brute force methods to determine the PHFs. SicHash
provides a more direct construction algorithm that only rarely needs to
recompute parts. Our implementation improves the state of the art in terms of
space usage versus construction time for a wide range of configurations. At the
same time, it provides very fast queries
- …