16 research outputs found
Fast Scalable Construction of (Minimal Perfect Hash) Functions
Recent advances in random linear systems on finite fields have paved the way
for the construction of constant-time data structures representing static
functions and minimal perfect hash functions using less space with respect to
existing techniques. The main obstruction for any practical application of
these results is the cubic-time Gaussian elimination required to solve these
linear systems: despite they can be made very small, the computation is still
too slow to be feasible.
In this paper we describe in detail a number of heuristics and programming
techniques to speed up the resolution of these systems by several orders of
magnitude, making the overall construction competitive with the standard and
widely used MWHC technique, which is based on hypergraph peeling. In
particular, we introduce broadword programming techniques for fast equation
manipulation and a lazy Gaussian elimination algorithm. We also describe a
number of technical improvements to the data structure which further reduce
space usage and improve lookup speed.
Our implementation of these techniques yields a minimal perfect hash function
data structure occupying 2.24 bits per element, compared to 2.68 for MWHC-based
ones, and a static function data structure which reduces the multiplicative
overhead from 1.23 to 1.03
Fast evaluation of union-intersection expressions
We show how to represent sets in a linear space data structure such that
expressions involving unions and intersections of sets can be computed in a
worst-case efficient way. This problem has applications in e.g. information
retrieval and database systems. We mainly consider the RAM model of
computation, and sets of machine words, but also state our results in the I/O
model. On a RAM with word size , a special case of our result is that the
intersection of (preprocessed) sets, containing elements in total, can
be computed in expected time , where is the
number of elements in the intersection. If the first of the two terms
dominates, this is a factor faster than the standard solution of
merging sorted lists. We show a cell probe lower bound of time , meaning that our upper bound is nearly
optimal for small . Our algorithm uses a novel combination of approximate
set representations and word-level parallelism
Secondary Indexing in One Dimension: Beyond B-trees and Bitmap Indexes
Let S be a finite, ordered alphabet, and let x = x_1 x_2 ... x_n be a string
over S. A "secondary index" for x answers alphabet range queries of the form:
Given a range [a_l,a_r] over S, return the set I_{[a_l;a_r]} = {i |x_i \in
[a_l; a_r]}. Secondary indexes are heavily used in relational databases and
scientific data analysis. It is well-known that the obvious solution, storing a
dictionary for the position set associated with each character, does not always
give optimal query time. In this paper we give the first theoretically optimal
data structure for the secondary indexing problem. In the I/O model, the amount
of data read when answering a query is within a constant factor of the minimum
space needed to represent I_{[a_l;a_r]}, assuming that the size of internal
memory is (|S| log n)^{delta} blocks, for some constant delta > 0. The space
usage of the data structure is O(n log |S|) bits in the worst case, and we
further show how to bound the size of the data structure in terms of the 0-th
order entropy of x. We show how to support updates achieving various time-space
trade-offs.
We also consider an approximate version of the basic secondary indexing
problem where a query reports a superset of I_{[a_l;a_r]} containing each
element not in I_{[a_l;a_r]} with probability at most epsilon, where epsilon >
0 is the false positive probability. For this problem the amount of data that
needs to be read by the query algorithm is reduced to O(|I_{[a_l;a_r]}|
log(1/epsilon)) bits.Comment: 16 page
Approximate Range Emptiness in Constant Time and Optimal Space
This paper studies the \emph{-approximate range emptiness} problem, where the task is to represent a set of points from and answer emptiness queries of the form " ?" with a probability of \emph{false positives} allowed. This generalizes the functionality of \emph{Bloom filters} from single point queries to any interval length . Setting the false positive rate to and performing queries, Bloom filters yield a solution to this problem with space bits, false positive probability bounded by for intervals of length up to , using query time . Our first contribution is to show that the space/error trade-off cannot be improved asymptotically: Any data structure for answering approximate range emptiness queries on intervals of length up to with false positive probability , must use space bits. On the positive side we show that the query time can be improved greatly, to constant time, while matching our space lower bound up to a lower order additive term. This result is achieved through a succinct data structure for (non-approximate 1d) range emptiness/reporting queries, which may be of independent interest
Bloom Filters in Adversarial Environments
Many efficient data structures use randomness, allowing them to improve upon
deterministic ones. Usually, their efficiency and correctness are analyzed
using probabilistic tools under the assumption that the inputs and queries are
independent of the internal randomness of the data structure. In this work, we
consider data structures in a more robust model, which we call the adversarial
model. Roughly speaking, this model allows an adversary to choose inputs and
queries adaptively according to previous responses. Specifically, we consider a
data structure known as "Bloom filter" and prove a tight connection between
Bloom filters in this model and cryptography.
A Bloom filter represents a set of elements approximately, by using fewer
bits than a precise representation. The price for succinctness is allowing some
errors: for any it should always answer `Yes', and for any it should answer `Yes' only with small probability.
In the adversarial model, we consider both efficient adversaries (that run in
polynomial time) and computationally unbounded adversaries that are only
bounded in the number of queries they can make. For computationally bounded
adversaries, we show that non-trivial (memory-wise) Bloom filters exist if and
only if one-way functions exist. For unbounded adversaries we show that there
exists a Bloom filter for sets of size and error , that is
secure against queries and uses only
bits of memory. In comparison, is the best
possible under a non-adaptive adversary