300 research outputs found
The Case for Learned Index Structures
Indexes are models: a B-Tree-Index can be seen as a model to map a key to the
position of a record within a sorted array, a Hash-Index as a model to map a
key to a position of a record within an unsorted array, and a BitMap-Index as a
model to indicate if a data record exists or not. In this exploratory research
paper, we start from this premise and posit that all existing index structures
can be replaced with other types of models, including deep-learning models,
which we term learned indexes. The key idea is that a model can learn the sort
order or structure of lookup keys and use this signal to effectively predict
the position or existence of records. We theoretically analyze under which
conditions learned indexes outperform traditional index structures and describe
the main challenges in designing learned index structures. Our initial results
show, that by using neural nets we are able to outperform cache-optimized
B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over
several real-world data sets. More importantly though, we believe that the idea
of replacing core components of a data management system through learned models
has far reaching implications for future systems designs and that this work
just provides a glimpse of what might be possible
Balanced Allocations and Double Hashing
Double hashing has recently found more common usage in schemes that use
multiple hash functions. In double hashing, for an item , one generates two
hash values and , and then uses combinations for to generate multiple hash values from the initial two. We
first perform an empirical study showing that, surprisingly, the performance
difference between double hashing and fully random hashing appears negligible
in the standard balanced allocation paradigm, where each item is placed in the
least loaded of choices, as well as several related variants. We then
provide theoretical results that explain the behavior of double hashing in this
context.Comment: Further updated, small improvements/typos fixe
Dynamic Space Efficient Hashing
We consider space efficient hash tables that can grow and shrink dynamically and are always highly space efficient, i.e., their space consumption is always close to the lower bound even while growing and when taking into account storage that is only needed temporarily. None of the traditionally used hash tables have this property. We show how known approaches like linear probing and bucket cuckoo hashing can be adapted to this scenario by subdividing them into many subtables or using virtual memory overcommitting. However, these rather straightforward solutions suffer from slow amortized insertion times due to frequent reallocation in small increments.
Our main result is DySECT (Dynamic Space Efficient Cuckoo Table) which avoids these problems. DySECT consists of many subtables which grow by doubling their size. The resulting inhomogeneity in subtable sizes is equalized by the flexibility available in bucket cuckoo hashing where each element can go to several buckets each of which containing several cells. Experiments indicate that DySECT works well with load factors up to 98%. With up to 2.7 times better performance than the next best solution
Towards Application of Cuckoo Filters in Network Security Monitoring
In this paper, we study the feasibility of applying the recently proposed cuckoo filters to improve space efficiency for set membership testing in Network Security Monitoring, focusing on the example of Threat Intelligence matching. We present conceptual insights for the practical application of cuckoo filters and provide a cuckoo filter implementation that allows runtime configuration. To evaluate the practical applicability of cuckoo filters, we integrate our implementation into the Bro Network Security Monitor, compare it to traditional data structures and conduct a brief operational evaluation. We find that cuckoo filters allow remarkable memory savings, while potential performance trade-offs, caused by introducing false positives, have to be carefully evaluated on a case-by-case basis
Quantum attacks on Bitcoin, and how to protect against them
The key cryptographic protocols used to secure the internet and financial
transactions of today are all susceptible to attack by the development of a
sufficiently large quantum computer. One particular area at risk are
cryptocurrencies, a market currently worth over 150 billion USD. We investigate
the risk of Bitcoin, and other cryptocurrencies, to attacks by quantum
computers. We find that the proof-of-work used by Bitcoin is relatively
resistant to substantial speedup by quantum computers in the next 10 years,
mainly because specialized ASIC miners are extremely fast compared to the
estimated clock speed of near-term quantum computers. On the other hand, the
elliptic curve signature scheme used by Bitcoin is much more at risk, and could
be completely broken by a quantum computer as early as 2027, by the most
optimistic estimates. We analyze an alternative proof-of-work called Momentum,
based on finding collisions in a hash function, that is even more resistant to
speedup by a quantum computer. We also review the available post-quantum
signature schemes to see which one would best meet the security and efficiency
requirements of blockchain applications.Comment: 21 pages, 6 figures. For a rough update on the progress of Quantum
devices and prognostications on time from now to break Digital signatures,
see https://www.quantumcryptopocalypse.com/quantum-moores-law
Scalable Hash Tables
The term scalability with regards to this dissertation has two meanings: It means
taking the best possible advantage of the provided resources (both computational
and memory resources) and it also means scaling data structures in the literal sense,
i.e., growing the capacity, by “rescaling” the table.
Scaling well to computational resources implies constructing the fastest best per-
forming algorithms and data structures. On today’s many-core machines the best
performance is immediately associated with parallelism. Since CPU frequencies
have stopped growing about 10-15 years ago, parallelism is the only way to take ad-
vantage of growing computational resources. But for data structures in general and
hash tables in particular performance is not only linked to faster computations. The
most execution time is actually spent waiting for memory. Thus optimizing data
structures to reduce the amount of memory accesses or to take better advantage of
the memory hierarchy especially through predictable access patterns and prefetch-
ing is just as important.
In terms of scaling the size of hash tables we have identified three domains where
scaling hash-based data structures have been lacking previously, i.e., space effi-
cient growing, concurrent hash tables, and Approximate Membership Query data
structures (AMQ-filter). Throughout this dissertation, we describe the problems
in these areas and develop efficient solutions. We highlight three different libraries
that we have developed over the course of this dissertation, each containing mul-
tiple implementations that have shown throughout our testing to be among the
best implementations in their respective domains. In this composition they offer
a comprehensive toolbox that can be used to solve many kinds of hashing related
problems or to develop individual solutions for further ones.
DySECT is a library for space efficient hash tables specifically growing space effi-
cient hash tables that scale with their input size. It contains the namesake DySECT
data structure in addition to a number of different probing and cuckoo based im-
plementations. Growt is a library for highly efficient concurrent hash tables. It
contains a very fast base table and a number of extensions to adapt this table to
match any purpose. All extension can be combined to create a variety of different
interfaces. In our extensive experimental evaluation, each adaptation has shown
to be among the best hash tables for their specific purpose. Lpqfilter is a library
for concurrent approximate membership query (AMQ) data structures. It contains
some original data structures, like the linear probing quotient filter, as well as some
novel approaches to dynamically sized quotient filters
- …