109 research outputs found
Fast and Simple Compact Hashing via Bucketing
Compact hash tables store a set S of n key-value pairs, where the keys are from the universe U = {0, ..., u - 1}, and the values are v-bit integers, in close to B(u, n) + nv bits of space, where B(u, n) = log2 ((u)(n)) is the information-theoretic lower bound for representing the set of keys in S, and support operations insert, delete and lookup on S. Compact hash tables have received significant attention in recent years, and approaches dating back to Cleary [IEEE T. Comput, 1984], as well as more recent ones have been implemented and used in a number of applications. However, the wins on space usage of these approaches are outweighed by their slowness relative to conventional hash tables. In this paper, we demonstrate that compact hash tables based upon a simple idea of bucketing practically outperform existing compact hash table implementations in terms of memory usage and construction time, and existing fast hash table implementations in terms of memory usage (and sometimes also in terms of construction time), while having competitive query times. A related notion is that of a compact hash ID map, which stores a set (S) over cap of n keys from U, and implicitly associates each key in (S) over cap with a unique value (its ID), chosen by the data structure itself, which is an integer of magnitude O(n), and supports inserts and lookups on S, while using space close to B(u, n) bits. One of our approaches is suitable for use as a compact hash ID map.Peer reviewe
DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient Near-Duplicate Crash Report Detection
Automatic crash bucketing is a crucial phase in the software development
process for efficiently triaging bug reports. It generally consists in grouping
similar reports through clustering techniques. However, with real-time
streaming bug collection, systems are needed to quickly answer the question:
What are the most similar bugs to a new one?, that is, efficiently find
near-duplicates. It is thus natural to consider nearest neighbors search to
tackle this problem and especially the well-known locality-sensitive hashing
(LSH) to deal with large datasets due to its sublinear performance and
theoretical guarantees on the similarity search accuracy. Surprisingly, LSH has
not been considered in the crash bucketing literature. It is indeed not trivial
to derive hash functions that satisfy the so-called locality-sensitive property
for the most advanced crash bucketing metrics. Consequently, we study in this
paper how to leverage LSH for this task. To be able to consider the most
relevant metrics used in the literature, we introduce DeepLSH, a Siamese DNN
architecture with an original loss function, that perfectly approximates the
locality-sensitivity property even for Jaccard and Cosine metrics for which
exact LSH solutions exist. We support this claim with a series of experiments
on an original dataset, which we make available
Succinct Indexable Dictionaries with Applications to Encoding -ary Trees, Prefix Sums and Multisets
We consider the {\it indexable dictionary} problem, which consists of storing
a set for some integer , while supporting the
operations of \Rank(x), which returns the number of elements in that are
less than if , and -1 otherwise; and \Select(i) which returns
the -th smallest element in . We give a data structure that supports both
operations in O(1) time on the RAM model and requires bits to store a set of size , where {\cal B}(n,m) = \ceil{\lg
{m \choose n}} is the minimum number of bits required to store any -element
subset from a universe of size . Previous dictionaries taking this space
only supported (yes/no) membership queries in O(1) time. In the cell probe
model we can remove the additive term in the space bound,
answering a question raised by Fich and Miltersen, and Pagh.
We present extensions and applications of our indexable dictionary data
structure, including:
An information-theoretically optimal representation of a -ary cardinal
tree that supports standard operations in constant time,
A representation of a multiset of size from in bits that supports (appropriate generalizations of) \Rank
and \Select operations in constant time, and
A representation of a sequence of non-negative integers summing up to
in bits that supports prefix sum queries in constant
time.Comment: Final version of SODA 2002 paper; supersedes Leicester Tech report
2002/1
Learned Monotone Minimal Perfect Hashing
A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of
keys is a function that maps each key in S to its rank. On keys not in S, the
function returns an arbitrary value. Applications range from databases, search
engines, data encryption, to pattern-matching algorithms.
In this paper, we describe LeMonHash, a new technique for constructing MMPHFs
for integers. The core idea of LeMonHash is surprisingly simple and effective:
we learn a monotone mapping from keys to their rank via an error-bounded
piecewise linear model (the PGM-index), and then we solve the collisions that
might arise among keys mapping to the same rank estimate by associating small
integers with them in a retrieval data structure (BuRR). On synthetic random
datasets, LeMonHash needs 35% less space than the next best competitor, while
achieving about 16 times faster queries. On real-world datasets, the space
usage is very close to or much better than the best competitors, while
achieving up to 19 times faster queries than the next larger competitor. As far
as the construction of LeMonHash is concerned, we get an improvement by a
factor of up to 2, compared to the competitor with the next best space usage.
We also investigate the case of keys being variable-length strings,
introducing the so-called LeMonHash-VL: it needs space within 10% of the best
competitors while achieving up to 3 times faster queries
Learned Monotone Minimal Perfect Hashing
A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms.
In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of LeMonHash is surprisingly simple and effective: we learn a monotone mapping from keys to their rank via an error-bounded piecewise linear model (the PGM-index), and then we solve the collisions that might arise among keys mapping to the same rank estimate by associating small integers with them in a retrieval data structure (BuRR). On synthetic random datasets, LeMonHash needs 34% less space than the next larger competitor, while achieving about 16 times faster queries. On real-world datasets, the space usage is very close to or much better than the best competitors, while achieving up to 19 times faster queries than the next larger competitor. As far as the construction of LeMonHash is concerned, we get an improvement by a factor of up to 2, compared to the competitor with the next best space usage.
We also investigate the case of keys being variable-length strings, introducing the so-called LeMonHash-VL: it needs space within 13% of the best competitors while achieving up to 3 times faster queries than the next larger competitor
Learned Monotone Minimal Perfect Hashing
A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms.
In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of LeMonHash is surprisingly simple and effective: we learn a monotone mapping from keys to their rank via an error-bounded piecewise linear model (the PGM-index), and then we solve the collisions that might arise among keys mapping to the same rank estimate by associating small integers with them in a retrieval data structure (BuRR). On synthetic random datasets, LeMonHash needs 34% less space than the next larger competitor, while achieving about 16 times faster queries. On real-world datasets, the space usage is very close to or much better than the best competitors, while achieving up to 19 times faster queries than the next larger competitor. As far as the construction of LeMonHash is concerned, we get an improvement by a factor of up to 2, compared to the competitor with the next best space usage.
We also investigate the case of keys being variable-length strings, introducing the so-called LeMonHash-VL: it needs space within 13% of the best competitors while achieving up to 3 times faster queries than the next larger competitor
- …