Search CORE

4 research outputs found

ENGINEERING COMPRESSED STATIC FUNCTIONS AND MINIMAL PERFECT HASH FUNCTIONS

Author: M. Genuzio
Publication venue: Università degli Studi di Milano
Publication date: 27/02/2018
Field of study

\emph{Static functions} are data structures meant to store arbitrary mappings from finite sets to integers; that is, given universe of items

U

, a set of

n \in \mathbb{N}

pairs

(k_i,v_i)

where

k_i \in S \subset U, |S|=n

, and

v_i \in \{0, 1, \ldots, m-1\} , m \in \mathbb{N}

, a static function will retrieve

v_i

given

k_i

(usually, in constant time). When every key is mapped into a different value this function is called \emph{perfect hash function} and when

n=m

the data structure yields an injective numbering

S\to \lbrace0,1, \ldots n-1 \rbrace

; this mapping is called a \emph{minimal perfect hash function}. Big data brought back one of the most critical challenges that computer scientists have been tackling during the last fifty years, that is, analyzing big amounts of data that do not fit in main memory. While for small keysets these mappings can be easily implemented using hash tables, this solution does not scale well for bigger sets. Static functions and MPHFs break the information-theoretical lower bound of storing the set

S

because they are allowed to return \emph{any} value if the queried key is not in the original keyset. The classical constructions technique for static functions can achieve just

O(nb)

bits space, where

b=\log(m)

, and the one for MPHFs

O(n)

bits of space (always with constant access time). All these features make static functions and MPHFs powerful techniques when handling, for instance, large sets of strings, and they are essential building blocks of space-efficient data structures such as (compressed) full-text indexes, monotone MPHFs, Bloom filter-like data structures, and prefix-search data structures. The biggest challenge of this construction technique involves lowering the multiplicative constants hidden inside the asymptotic space bounds while keeping feasible construction times. In this thesis, we take advantage of the recent result in random linear systems theory regarding the ratio between the number of variables and number of the equations, and in perfect hash data structures, to achieve practical static functions with the lowest space bounds so far, and construction time comparable with widely used techniques. The new results, however, require solving linear systems that require more than a simple triangulation process, as it happens in current state-of-the-art solutions. The main challenge in making such structures usable is mitigating the cubic running time of Gaussian elimination at construction time. To this purpose, we introduce novel techniques based on \emph{broadword programming} and a heuristic derived from \emph{structured Gaussian elimination}. We obtained data structures that are significantly smaller than commonly used hypergraph-based constructions while maintaining or improving the lookup times and providing still feasible construction.We then apply these improvements to another kind of structures: \emph{compressed static hash functions}. The theoretical construction technique for this kind of data structure uses prefix-free codes with variable length to encode the set of values. Adopting this solution, we can reduce the\n space usage of each element to (essentially) the entropy of the list of output values of the function.Indeed, we need to solve an even bigger linear system of equations, and the time required to build the structure increases. In this thesis, we present the first engineered implementation of compressed hash functions. For example, we were able to store a function with geometrically distributed output, with parameter

p=0.5

in just

2.28

bit per key, independently of the key set, with a construction time double with respect to that of a state-of-the-art non-compressed function, which requires

\approx\log \log n

bits per key, where

n

is the number of keys, and similar lookup time. We can also store a function with an output distributed following a Zipfian distribution with parameter

s=2

and

N= 10^6

in just

2.75

bits per key, whereas a non-compressed function would require more than

20

, with a threefold increase in construction time and significantly faster lookups

AIR Universita degli studi di Milano

Compressing Labels of Dynamic XML Data using Base-9 Scheme and Fibonacci Encoding

Author: Al Zadjali Hanaa
Publication venue: 'University of Sheffield Conference Proceedings'
Publication date: 01/11/2017
Field of study

The flexibility and self-describing nature of XML has made it the most common mark-up language used for data representation over the Web. XML data is naturally modelled as a tree, where the structural tree information can be encoded into labels via XML labelling scheme in order to permit answers to queries without the need to access original XML files. As the transmission of XML data over the Internet has become vibrant, it has also become necessary to have an XML labelling scheme that supports dynamic XML data. For a large-scale and frequently updated XML document, existing dynamic XML labelling schemes still suffer from high growth rates in terms of their label size, which can result in overflow problems and/or ambiguous data/query retrievals. This thesis considers the compression of XML labels. A novel XML labelling scheme, named “Base-9”, has been developed to generate labels that are as compact as possible and yet provide efficient support for queries to both static and dynamic XML data. A Fibonacci prefix-encoding method has been used for the first time to store Base-9’s XML labels in a compressed format, with the intention of minimising the storage space without degrading XML querying performance. The thesis also investigates the compression of XML labels using various existing prefix-encoding methods. This investigation has resulted in the proposal of a novel prefix-encoding method named “Elias-Fibonacci of order 3”, which has achieved the fastest encoding time of all prefix-encoding methods studied in this thesis, whereas Fibonacci encoding was found to require the minimum storage. Unlike current XML labelling schemes, the new Base-9 labelling scheme ensures the generation of short labels even after large, frequent, skewed insertions. The advantages of such short labels as those generated by the combination of applying the Base-9 scheme and the use of Fibonacci encoding in terms of storing, updating, retrieving and querying XML data are supported by the experimental results reported herein

White Rose E-theses Online

Efficient and Compact Representations of Prefix Codes

Author: Alberto Ordonez
Gonzalo Navarro
Travis Gagie
Yakov Nekrich
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref