9,985 research outputs found

    The Case for Learned Index Structures

    Full text link
    Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible

    Approximate Range Emptiness in Constant Time and Optimal Space

    No full text
    This paper studies the \emph{ε\varepsilon-approximate range emptiness} problem, where the task is to represent a set SS of nn points from {0,,U1}\{0,\ldots,U-1\} and answer emptiness queries of the form "[a;b]S[a ; b]\cap S \neq \emptyset ?" with a probability of \emph{false positives} allowed. This generalizes the functionality of \emph{Bloom filters} from single point queries to any interval length LL. Setting the false positive rate to ε/L\varepsilon/L and performing LL queries, Bloom filters yield a solution to this problem with space O(nlg(L/ε))O(n \lg(L/\varepsilon)) bits, false positive probability bounded by ε\varepsilon for intervals of length up to LL, using query time O(Llg(L/ε))O(L \lg(L/\varepsilon)). Our first contribution is to show that the space/error trade-off cannot be improved asymptotically: Any data structure for answering approximate range emptiness queries on intervals of length up to LL with false positive probability ε\varepsilon, must use space Ω(nlg(L/ε))O(n)\Omega(n \lg(L/\varepsilon)) - O(n) bits. On the positive side we show that the query time can be improved greatly, to constant time, while matching our space lower bound up to a lower order additive term. This result is achieved through a succinct data structure for (non-approximate 1d) range emptiness/reporting queries, which may be of independent interest

    Bloom Filters in Adversarial Environments

    Get PDF
    Many efficient data structures use randomness, allowing them to improve upon deterministic ones. Usually, their efficiency and correctness are analyzed using probabilistic tools under the assumption that the inputs and queries are independent of the internal randomness of the data structure. In this work, we consider data structures in a more robust model, which we call the adversarial model. Roughly speaking, this model allows an adversary to choose inputs and queries adaptively according to previous responses. Specifically, we consider a data structure known as "Bloom filter" and prove a tight connection between Bloom filters in this model and cryptography. A Bloom filter represents a set SS of elements approximately, by using fewer bits than a precise representation. The price for succinctness is allowing some errors: for any xSx \in S it should always answer `Yes', and for any xSx \notin S it should answer `Yes' only with small probability. In the adversarial model, we consider both efficient adversaries (that run in polynomial time) and computationally unbounded adversaries that are only bounded in the number of queries they can make. For computationally bounded adversaries, we show that non-trivial (memory-wise) Bloom filters exist if and only if one-way functions exist. For unbounded adversaries we show that there exists a Bloom filter for sets of size nn and error ε\varepsilon, that is secure against tt queries and uses only O(nlog1ε+t)O(n \log{\frac{1}{\varepsilon}}+t) bits of memory. In comparison, nlog1εn\log{\frac{1}{\varepsilon}} is the best possible under a non-adaptive adversary
    corecore