Search CORE

6 research outputs found

Learning-Augmented Skip Lists

Author: Fu Chunkai
Seo Jung Hoon
Zhou Samson
Publication venue
Publication date: 16/02/2024
Field of study

We study the integration of machine learning advice into the design of skip lists to improve upon traditional data structure design. Given access to a possibly erroneous oracle that outputs estimated fractional frequencies for search queries on a set of items, we construct a skip list that provably provides the optimal expected search time, within nearly a factor of two. In fact, our learning-augmented skip list is still optimal up to a constant factor, even if the oracle is only accurate within a constant factor. We show that if the search queries follow the ubiquitous Zipfian distribution, then the expected search time for an item by our skip list is only a constant, independent of the total number

n

of items, i.e.,

\mathcal{O}(1)

, whereas a traditional skip list will have an expected search time of

\mathcal{O}(\log n)

. We also demonstrate robustness by showing that our data structure achieves an expected search time that is within a constant factor of an oblivious skip list construction even when the predictions are arbitrarily incorrect. Finally, we empirically show that our learning-augmented skip list outperforms traditional skip lists on both synthetic and real-world datasets

arXiv.org e-Print Archive

Learning-Augmented B-Trees

Author: Cao Xinyuan
Chen Jingbang
Chen Li
Lambert Chris
Peng Richard
Sleator Daniel
Publication venue
Publication date: 24/07/2023
Field of study

We study learning-augmented binary search trees (BSTs) and B-Trees via Treaps with composite priorities. The result is a simple search tree where the depth of each item is determined by its predicted weight

w_x

. To achieve the result, each item

x

has its composite priority

-\lfloor\log\log(1/w_x)\rfloor + U(0, 1)

where

U(0, 1)

is the uniform random variable. This generalizes the recent learning-augmented BSTs [Lin-Luo-Woodruff ICML`22], which only work for Zipfian distributions, to arbitrary inputs and predictions. It also gives the first B-Tree data structure that can provably take advantage of localities in the access sequence via online self-reorganization. The data structure is robust to prediction errors and handles insertions, deletions, as well as prediction updates.Comment: 25 page

arXiv.org e-Print Archive

Bridging the gap between algorithmic and learned index structures

Author: Hadian Ali
Publication venue: Computing, Imperial College London
Publication date: 01/07/2022
Field of study

Index structures such as B-trees and bloom filters are the well-established petrol engines of database systems. However, these structures do not fully exploit patterns in data distribution. To address this, researchers have suggested using machine learning models as electric engines that can entirely replace index structures. Such a paradigm shift in data system design, however, opens many unsolved design challenges. More research is needed to understand the theoretical guarantees and design efficient support for insertion and deletion. In this thesis, we adopt a different position: index algorithms are good enough, and instead of going back to the drawing board to fit data systems with learned models, we should develop lightweight hybrid engines that build on the benefits of both algorithmic and learned index structures. The indexes that we suggest provide the theoretical performance guarantees and updatability of algorithmic indexes while using position prediction models to leverage the data distributions and thereby improve the performance of the index structure. We investigate the potential for minimal modifications to algorithmic indexes such that they can leverage data distribution similar to how learned indexes work. In this regard, we propose and explore the use of helping models that boost classical index performance using techniques from machine learning. Our suggested approach inherits performance guarantees from its algorithmic baseline index, but at the same time it considers the data distribution to improve performance considerably. We study single-dimensional range indexes, spatial indexes, and stream indexing, and show that the suggested approach results in range indexes that outperform the algorithmic indexes and have comparable performance to the read-only, fully learned indexes and hence can be reliably used as a default index structure in a database engine. Besides, we consider the updatability of the indexes and suggest solutions for updating the index, notably when the data distribution drastically changes over time (e.g., for indexing data streams). In particular, we propose a specific learning-augmented index for indexing a sliding window with timestamps in a data stream. Additionally, we highlight the limitations of learned indexes for low-latency lookup on real- world data distributions. To tackle this issue, we suggest adding an algorithmic enhancement layer to a learned model to correct the prediction error with a small memory latency. This approach enables efficient modelling of the data distribution and resolves the local biases of a learned model at the cost of roughly one memory lookup.Open Acces

Spiral - Imperial College Digital Repository