ENGINEERING COMPRESSED STATIC FUNCTIONS AND MINIMAL PERFECT HASH FUNCTIONS

M. Genuzio

ENGINEERING COMPRESSED STATIC FUNCTIONS AND MINIMAL PERFECT HASH FUNCTIONS

Authors: M. Genuzio
Publication date: 27 February 2018
Publisher: Università degli Studi di Milano
Doi

Abstract

\emph{Static functions} are data structures meant to store arbitrary mappings from finite sets to integers; that is, given universe of items

U

, a set of

n \in \mathbb{N}

pairs

(k_i,v_i)

where

k_i \in S \subset U, |S|=n

, and

v_i \in \{0, 1, \ldots, m-1\} , m \in \mathbb{N}

, a static function will retrieve

v_i

given

k_i

(usually, in constant time). When every key is mapped into a different value this function is called \emph{perfect hash function} and when

n=m

the data structure yields an injective numbering

S\to \lbrace0,1, \ldots n-1 \rbrace

; this mapping is called a \emph{minimal perfect hash function}. Big data brought back one of the most critical challenges that computer scientists have been tackling during the last fifty years, that is, analyzing big amounts of data that do not fit in main memory. While for small keysets these mappings can be easily implemented using hash tables, this solution does not scale well for bigger sets. Static functions and MPHFs break the information-theoretical lower bound of storing the set

S

because they are allowed to return \emph{any} value if the queried key is not in the original keyset. The classical constructions technique for static functions can achieve just

O(nb)

bits space, where

b=\log(m)

, and the one for MPHFs

O(n)

bits of space (always with constant access time). All these features make static functions and MPHFs powerful techniques when handling, for instance, large sets of strings, and they are essential building blocks of space-efficient data structures such as (compressed) full-text indexes, monotone MPHFs, Bloom filter-like data structures, and prefix-search data structures. The biggest challenge of this construction technique involves lowering the multiplicative constants hidden inside the asymptotic space bounds while keeping feasible construction times. In this thesis, we take advantage of the recent result in random linear systems theory regarding the ratio between the number of variables and number of the equations, and in perfect hash data structures, to achieve practical static functions with the lowest space bounds so far, and construction time comparable with widely used techniques. The new results, however, require solving linear systems that require more than a simple triangulation process, as it happens in current state-of-the-art solutions. The main challenge in making such structures usable is mitigating the cubic running time of Gaussian elimination at construction time. To this purpose, we introduce novel techniques based on \emph{broadword programming} and a heuristic derived from \emph{structured Gaussian elimination}. We obtained data structures that are significantly smaller than commonly used hypergraph-based constructions while maintaining or improving the lookup times and providing still feasible construction.We then apply these improvements to another kind of structures: \emph{compressed static hash functions}. The theoretical construction technique for this kind of data structure uses prefix-free codes with variable length to encode the set of values. Adopting this solution, we can reduce the\n space usage of each element to (essentially) the entropy of the list of output values of the function.Indeed, we need to solve an even bigger linear system of equations, and the time required to build the structure increases. In this thesis, we present the first engineered implementation of compressed hash functions. For example, we were able to store a function with geometrically distributed output, with parameter

p=0.5

in just

2.28

bit per key, independently of the key set, with a construction time double with respect to that of a state-of-the-art non-compressed function, which requires

\approx\log \log n

bits per key, where

n

is the number of keys, and similar lookup time. We can also store a function with an output distributed following a Zipfian distribution with parameter