4,412 research outputs found

    ShockHash: Towards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force

    Full text link
    A minimal perfect hash function (MPHF) maps a set SS of nn keys to the first nn integers without collisions. There is a lower bound of nlog⁡2e−O(log⁡n)n\log_2e-O(\log n) bits of space needed to represent an MPHF. A matching upper bound is obtained using the brute-force algorithm that tries random hash functions until stumbling on an MPHF and stores that function's seed. In expectation, enpoly(n)e^n\textrm{poly}(n) seeds need to be tested. The most space-efficient previous algorithms for constructing MPHFs all use such a brute-force approach as a basic building block. In this paper, we introduce ShockHash - Small, heavily overloaded cuckoo hash tables. ShockHash uses two hash functions h0h_0 and h1h_1, hoping for the existence of a function f:S→{0,1}f : S \rightarrow \{0,1\} such that x↩hf(x)(x)x \mapsto h_{f(x)}(x) is an MPHF on SS. In graph terminology, ShockHash generates nn-edge random graphs until stumbling on a pseudoforest - a graph where each component contains as many edges as nodes. Using cuckoo hashing, ShockHash then derives an MPHF from the pseudoforest in linear time. It uses a 1-bit retrieval data structure to store ff using n+o(n)n + o(n) bits. By carefully analyzing the probability that a random graph is a pseudoforest, we show that ShockHash needs to try only (e/2)npoly(n)(e/2)^n\textrm{poly}(n) hash function seeds in expectation, reducing the space for storing the seed by roughly nn bits. This makes ShockHash almost a factor 2n2^n faster than brute-force, while maintaining the asymptotically optimal space consumption. An implementation within the RecSplit framework yields the currently most space efficient MPHFs, i.e., competing approaches need about two orders of magnitude more work to achieve the same space

    Cache-Oblivious Peeling of Random Hypergraphs

    Full text link
    The computation of a peeling order in a randomly generated hypergraph is the most time-consuming step in a number of constructions, such as perfect hashing schemes, random rr-SAT solvers, error-correcting codes, and approximate set encodings. While there exists a straightforward linear time algorithm, its poor I/O performance makes it impractical for hypergraphs whose size exceeds the available internal memory. We show how to reduce the computation of a peeling order to a small number of sequential scans and sorts, and analyze its I/O complexity in the cache-oblivious model. The resulting algorithm requires O(sort(n))O(\mathrm{sort}(n)) I/Os and O(nlog⁥n)O(n \log n) time to peel a random hypergraph with nn edges. We experimentally evaluate the performance of our implementation of this algorithm in a real-world scenario by using the construction of minimal perfect hash functions (MPHF) as our test case: our algorithm builds a MPHF of 7.67.6 billion keys in less than 2121 hours on a single machine. The resulting data structure is both more space-efficient and faster than that obtained with the current state-of-the-art MPHF construction for large-scale key sets

    Fast Scalable Construction of (Minimal Perfect Hash) Functions

    Full text link
    Recent advances in random linear systems on finite fields have paved the way for the construction of constant-time data structures representing static functions and minimal perfect hash functions using less space with respect to existing techniques. The main obstruction for any practical application of these results is the cubic-time Gaussian elimination required to solve these linear systems: despite they can be made very small, the computation is still too slow to be feasible. In this paper we describe in detail a number of heuristics and programming techniques to speed up the resolution of these systems by several orders of magnitude, making the overall construction competitive with the standard and widely used MWHC technique, which is based on hypergraph peeling. In particular, we introduce broadword programming techniques for fast equation manipulation and a lazy Gaussian elimination algorithm. We also describe a number of technical improvements to the data structure which further reduce space usage and improve lookup speed. Our implementation of these techniques yields a minimal perfect hash function data structure occupying 2.24 bits per element, compared to 2.68 for MWHC-based ones, and a static function data structure which reduces the multiplicative overhead from 1.23 to 1.03

    SicHash -- Small Irregular Cuckoo Tables for Perfect Hashing

    Get PDF
    A Perfect Hash Function (PHF) is a hash function that has no collisions on a given input set. PHFs can be used for space efficient storage of data in an array, or for determining a compact representative of each object in the set. In this paper, we present the PHF construction algorithm SicHash - Small Irregular Cuckoo Tables for Perfect Hashing. At its core, SicHash uses a known technique: It places objects in a cuckoo hash table and then stores the final hash function choice of each object in a retrieval data structure. We combine the idea with irregular cuckoo hashing, where each object has a different number of hash functions. Additionally, we use many small tables that we overload beyond their asymptotic maximum load factor. The most space efficient competitors often use brute force methods to determine the PHFs. SicHash provides a more direct construction algorithm that only rarely needs to recompute parts. Our implementation improves the state of the art in terms of space usage versus construction time for a wide range of configurations. At the same time, it provides very fast queries

    Fast Algorithms for Parameterized Problems with Relaxed Disjointness Constraints

    Full text link
    In parameterized complexity, it is a natural idea to consider different generalizations of classic problems. Usually, such generalization are obtained by introducing a "relaxation" variable, where the original problem corresponds to setting this variable to a constant value. For instance, the problem of packing sets of size at most pp into a given universe generalizes the Maximum Matching problem, which is recovered by taking p=2p=2. Most often, the complexity of the problem increases with the relaxation variable, but very recently Abasi et al. have given a surprising example of a problem --- rr-Simple kk-Path --- that can be solved by a randomized algorithm with running time O∗(2O(klog⁡rr))O^*(2^{O(k \frac{\log r}{r})}). That is, the complexity of the problem decreases with rr. In this paper we pursue further the direction sketched by Abasi et al. Our main contribution is a derandomization tool that provides a deterministic counterpart of the main technical result of Abasi et al.: the O∗(2O(klog⁡rr))O^*(2^{O(k \frac{\log r}{r})}) algorithm for (r,k)(r,k)-Monomial Detection, which is the problem of finding a monomial of total degree kk and individual degrees at most rr in a polynomial given as an arithmetic circuit. Our technique works for a large class of circuits, and in particular it can be used to derandomize the result of Abasi et al. for rr-Simple kk-Path. On our way to this result we introduce the notion of representative sets for multisets, which may be of independent interest. Finally, we give two more examples of problems that were already studied in the literature, where the same relaxation phenomenon happens. The first one is a natural relaxation of the Set Packing problem, where we allow the packed sets to overlap at each element at most rr times. The second one is Degree Bounded Spanning Tree, where we seek for a spanning tree of the graph with a small maximum degree

    Fast and Scalable Minimal Perfect Hashing for Massive Key Sets

    Get PDF
    Minimal perfect hash functions provide space-efficient and collision-free hashing on static sets. Existing algorithms and implementations that build such functions have practical limitations on the number of input elements they can process, due to high construction time, RAM or external memory usage. We revisit a simple algorithm and show that it is highly competitive with the state of the art, especially in terms of construction time and memory usage. We provide a parallel C++ implementation called BBhash. It is capable of creating a minimal perfect hash function of 10^{10} elements in less than 7 minutes using 8 threads and 5 GB of memory, and the resulting function uses 3.7 bits/element. To the best of our knowledge, this is also the first implementation that has been successfully tested on an input of cardinality 10^{12}. Source code: https://github.com/rizkg/BBHas

    Parallel Searching for a First Solution

    Get PDF
    A parallel algorithm for conducting a search for a first solution to the problem of generating minimal perfect hash functions is presented. A message-based distributed memory computer is assumed as a model for parallel computations. A data structure, called reverse trie (r-trie), was devised to carry out the search. The algorithm was implemented on a transputer network. The experiments showed that the algorithm exhibits a consistent and almost linear speed-up. The r-trie structure proved to be highly memory efficient
    • 

    corecore