4,412 research outputs found
Recommended from our members
GPERF : a perfect hash function generator
gperf is a widely available perfect hash function generator written in C++. It automates a common system software operation: keyword recognition. gperf translates an n element user-specified keyword list keyfile into source code containing a k element lookup table and a pair of functions, phash and in_word_set. phash uniquely maps keywords in keyfile onto the range 0 .. k - 1, where k >/= n. If k = n, then phash is considered a minimal perfect hash function. in_word_set uses phash to determine whether a particular string of characters str occurs in the keyfile, using at most one string comparison.This paper describes the user-interface, options, features, algorithm design and implementation strategies incorporated in gperf. It also presents the results from an empirical comparison between gperf-generated recognizers and other popular techniques for reserved word lookup
ShockHash: Towards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force
A minimal perfect hash function (MPHF) maps a set of keys to the
first integers without collisions. There is a lower bound of
bits of space needed to represent an MPHF. A matching
upper bound is obtained using the brute-force algorithm that tries random hash
functions until stumbling on an MPHF and stores that function's seed. In
expectation, seeds need to be tested. The most
space-efficient previous algorithms for constructing MPHFs all use such a
brute-force approach as a basic building block.
In this paper, we introduce ShockHash - Small, heavily overloaded cuckoo hash
tables. ShockHash uses two hash functions and , hoping for the
existence of a function such that is an MPHF on . In graph terminology, ShockHash generates
-edge random graphs until stumbling on a pseudoforest - a graph where each
component contains as many edges as nodes. Using cuckoo hashing, ShockHash then
derives an MPHF from the pseudoforest in linear time. It uses a 1-bit retrieval
data structure to store using bits.
By carefully analyzing the probability that a random graph is a pseudoforest,
we show that ShockHash needs to try only hash
function seeds in expectation, reducing the space for storing the seed by
roughly bits. This makes ShockHash almost a factor faster than
brute-force, while maintaining the asymptotically optimal space consumption. An
implementation within the RecSplit framework yields the currently most space
efficient MPHFs, i.e., competing approaches need about two orders of magnitude
more work to achieve the same space
Cache-Oblivious Peeling of Random Hypergraphs
The computation of a peeling order in a randomly generated hypergraph is the
most time-consuming step in a number of constructions, such as perfect hashing
schemes, random -SAT solvers, error-correcting codes, and approximate set
encodings. While there exists a straightforward linear time algorithm, its poor
I/O performance makes it impractical for hypergraphs whose size exceeds the
available internal memory.
We show how to reduce the computation of a peeling order to a small number of
sequential scans and sorts, and analyze its I/O complexity in the
cache-oblivious model. The resulting algorithm requires
I/Os and time to peel a random hypergraph with edges.
We experimentally evaluate the performance of our implementation of this
algorithm in a real-world scenario by using the construction of minimal perfect
hash functions (MPHF) as our test case: our algorithm builds a MPHF of
billion keys in less than hours on a single machine. The resulting data
structure is both more space-efficient and faster than that obtained with the
current state-of-the-art MPHF construction for large-scale key sets
Fast Scalable Construction of (Minimal Perfect Hash) Functions
Recent advances in random linear systems on finite fields have paved the way
for the construction of constant-time data structures representing static
functions and minimal perfect hash functions using less space with respect to
existing techniques. The main obstruction for any practical application of
these results is the cubic-time Gaussian elimination required to solve these
linear systems: despite they can be made very small, the computation is still
too slow to be feasible.
In this paper we describe in detail a number of heuristics and programming
techniques to speed up the resolution of these systems by several orders of
magnitude, making the overall construction competitive with the standard and
widely used MWHC technique, which is based on hypergraph peeling. In
particular, we introduce broadword programming techniques for fast equation
manipulation and a lazy Gaussian elimination algorithm. We also describe a
number of technical improvements to the data structure which further reduce
space usage and improve lookup speed.
Our implementation of these techniques yields a minimal perfect hash function
data structure occupying 2.24 bits per element, compared to 2.68 for MWHC-based
ones, and a static function data structure which reduces the multiplicative
overhead from 1.23 to 1.03
SicHash -- Small Irregular Cuckoo Tables for Perfect Hashing
A Perfect Hash Function (PHF) is a hash function that has no collisions on a
given input set. PHFs can be used for space efficient storage of data in an
array, or for determining a compact representative of each object in the set.
In this paper, we present the PHF construction algorithm SicHash - Small
Irregular Cuckoo Tables for Perfect Hashing. At its core, SicHash uses a known
technique: It places objects in a cuckoo hash table and then stores the final
hash function choice of each object in a retrieval data structure. We combine
the idea with irregular cuckoo hashing, where each object has a different
number of hash functions. Additionally, we use many small tables that we
overload beyond their asymptotic maximum load factor. The most space efficient
competitors often use brute force methods to determine the PHFs. SicHash
provides a more direct construction algorithm that only rarely needs to
recompute parts. Our implementation improves the state of the art in terms of
space usage versus construction time for a wide range of configurations. At the
same time, it provides very fast queries
Fast Algorithms for Parameterized Problems with Relaxed Disjointness Constraints
In parameterized complexity, it is a natural idea to consider different
generalizations of classic problems. Usually, such generalization are obtained
by introducing a "relaxation" variable, where the original problem corresponds
to setting this variable to a constant value. For instance, the problem of
packing sets of size at most into a given universe generalizes the Maximum
Matching problem, which is recovered by taking . Most often, the
complexity of the problem increases with the relaxation variable, but very
recently Abasi et al. have given a surprising example of a problem ---
-Simple -Path --- that can be solved by a randomized algorithm with
running time . That is, the complexity of the
problem decreases with . In this paper we pursue further the direction
sketched by Abasi et al. Our main contribution is a derandomization tool that
provides a deterministic counterpart of the main technical result of Abasi et
al.: the algorithm for -Monomial
Detection, which is the problem of finding a monomial of total degree and
individual degrees at most in a polynomial given as an arithmetic circuit.
Our technique works for a large class of circuits, and in particular it can be
used to derandomize the result of Abasi et al. for -Simple -Path. On our
way to this result we introduce the notion of representative sets for
multisets, which may be of independent interest. Finally, we give two more
examples of problems that were already studied in the literature, where the
same relaxation phenomenon happens. The first one is a natural relaxation of
the Set Packing problem, where we allow the packed sets to overlap at each
element at most times. The second one is Degree Bounded Spanning Tree,
where we seek for a spanning tree of the graph with a small maximum degree
Fast and Scalable Minimal Perfect Hashing for Massive Key Sets
Minimal perfect hash functions provide space-efficient and collision-free hashing on static sets. Existing algorithms and implementations that build such functions have practical limitations on the number of input elements they can process, due to high construction time, RAM or external memory usage. We revisit a simple algorithm and show that it is highly competitive with the state of the art, especially in terms of construction time and memory usage. We provide a parallel C++ implementation called BBhash. It is capable of creating a minimal perfect hash function of 10^{10} elements in less than 7 minutes using 8 threads and 5 GB of memory, and the resulting function uses 3.7 bits/element. To the best of our knowledge, this is also the first implementation that has been successfully tested on an input of cardinality 10^{12}.
Source code: https://github.com/rizkg/BBHas
Parallel Searching for a First Solution
A parallel algorithm for conducting a search for a first solution to the problem of generating minimal perfect hash functions is presented. A message-based distributed memory computer is assumed as a model for parallel computations. A data structure, called reverse trie (r-trie), was devised to carry out the search. The algorithm was implemented on a transputer network. The experiments showed that the algorithm exhibits a consistent and almost linear speed-up. The r-trie structure proved to be highly memory efficient
- âŠ