4,184 research outputs found
Using cascading Bloom filters to improve the memory usage for de Brujin graphs
De Brujin graphs are widely used in bioinformatics for processing
next-generation sequencing data. Due to a very large size of NGS datasets, it
is essential to represent de Bruijn graphs compactly, and several approaches to
this problem have been proposed recently. In this work, we show how to reduce
the memory required by the algorithm of [3] that represents de Brujin graphs
using Bloom filters. Our method requires 30% to 40% less memory with respect to
the method of [3], with insignificant impact to construction time. At the same
time, our experiments showed a better query time compared to [3]. This is, to
our knowledge, the best practical representation for de Bruijn graphs.Comment: 12 pages, submitte
Fast Succinct Retrieval and Approximate Membership Using Ribbon
A retrieval data structure for a static function f: S → {0,1}^r supports queries that return f(x) for any x ∈ S. Retrieval data structures can be used to implement a static approximate membership query data structure (AMQ), i.e., a Bloom filter alternative, with false positive rate 2^{-r}. The information-theoretic lower bound for both tasks is r|S| bits. While succinct theoretical constructions using (1+o(1))r|S| bits were known, these could not achieve very small overheads in practice because they have an unfavorable space-time tradeoff hidden in the asymptotic costs or because small overheads would only be reached for physically impossible input sizes. With bumped ribbon retrieval (BuRR), we present the first practical succinct retrieval data structure. In an extensive experimental evaluation BuRR achieves space overheads well below 1% while being faster than most previously used retrieval data structures (typically with space overheads at least an order of magnitude larger) and faster than classical Bloom filters (with space overhead ≥ 44%). This efficiency, including favorable constants, stems from a combination of simplicity, word parallelism, and high locality.
We additionally describe homogeneous ribbon filter AMQs, which are even simpler and faster at the price of slightly larger space overhead
Fast Partitioned Learned Bloom Filter
A Bloom filter is a memory-efficient data structure for approximate
membership queries used in numerous fields of computer science. Recently,
learned Bloom filters that achieve better memory efficiency using machine
learning models have attracted attention. One such filter, the partitioned
learned Bloom filter (PLBF), achieves excellent memory efficiency. However,
PLBF requires a time complexity to construct the data structure,
where and are the hyperparameters of PLBF. One can improve memory
efficiency by increasing , but the construction time becomes extremely long.
Thus, we propose two methods that can reduce the construction time while
maintaining the memory efficiency of PLBF. First, we propose fast PLBF, which
can construct the same data structure as PLBF with a smaller time complexity
. Second, we propose fast PLBF++, which can construct the data
structure with even smaller time complexity . Fast PLBF++
does not necessarily construct the same data structure as PLBF. Still, it is
almost as memory efficient as PLBF, and it is proved that fast PLBF++ has the
same data structure as PLBF when the distribution satisfies a certain
constraint. Our experimental results from real-world datasets show that (i)
fast PLBF and fast PLBF++ can construct the data structure up to 233 and 761
times faster than PLBF, (ii) fast PLBF can achieve the same memory efficiency
as PLBF, and (iii) fast PLBF++ can achieve almost the same memory efficiency as
PLBF.Comment: NeurIPS 202
Efficient Gauss Elimination for Near-Quadratic Matrices with One Short Random Block per Row, with Applications
In this paper we identify a new class of sparse near-quadratic random Boolean matrices that have full row rank over F_2 = {0,1} with high probability and can be transformed into echelon form in almost linear time by a simple version of Gauss elimination. The random matrix with dimensions n(1-epsilon) x n is generated as follows: In each row, identify a block of length L = O((log n)/epsilon) at a random position. The entries outside the block are 0, the entries inside the block are given by fair coin tosses. Sorting the rows according to the positions of the blocks transforms the matrix into a kind of band matrix, on which, as it turns out, Gauss elimination works very efficiently with high probability. For the proof, the effects of Gauss elimination are interpreted as a ("coin-flipping") variant of Robin Hood hashing, whose behaviour can be captured in terms of a simple Markov model from queuing theory. Bounds for expected construction time and high success probability follow from results in this area. They readily extend to larger finite fields in place of F_2.
By employing hashing, this matrix family leads to a new implementation of a retrieval data structure, which represents an arbitrary function f: S -> {0,1} for some set S of m = (1-epsilon)n keys. It requires m/(1-epsilon) bits of space, construction takes O(m/epsilon^2) expected time on a word RAM, while queries take O(1/epsilon) time and access only one contiguous segment of O((log m)/epsilon) bits in the representation (O(1/epsilon) consecutive words on a word RAM). The method is readily implemented and highly practical, and it is competitive with state-of-the-art methods. In a more theoretical variant, which works only for unrealistically large S, we can even achieve construction time O(m/epsilon) and query time O(1), accessing O(1) contiguous memory words for a query. By well-established methods the retrieval data structure leads to efficient constructions of (static) perfect hash functions and (static) Bloom filters with almost optimal space and very local storage access patterns for queries
Neural Distributed Autoassociative Memories: A Survey
Introduction. Neural network models of autoassociative, distributed memory
allow storage and retrieval of many items (vectors) where the number of stored
items can exceed the vector dimension (the number of neurons in the network).
This opens the possibility of a sublinear time search (in the number of stored
items) for approximate nearest neighbors among vectors of high dimension. The
purpose of this paper is to review models of autoassociative, distributed
memory that can be naturally implemented by neural networks (mainly with local
learning rules and iterative dynamics based on information locally available to
neurons). Scope. The survey is focused mainly on the networks of Hopfield,
Willshaw and Potts, that have connections between pairs of neurons and operate
on sparse binary vectors. We discuss not only autoassociative memory, but also
the generalization properties of these networks. We also consider neural
networks with higher-order connections and networks with a bipartite graph
structure for non-binary data with linear constraints. Conclusions. In
conclusion we discuss the relations to similarity search, advantages and
drawbacks of these techniques, and topics for further research. An interesting
and still not completely resolved question is whether neural autoassociative
memories can search for approximate nearest neighbors faster than other index
structures for similarity search, in particular for the case of very high
dimensional vectors.Comment: 31 page
Space-efficient data sketching algorithms for network applications
Sketching techniques are widely adopted in network applications. Sketching algorithms “encode” data into succinct data structures that can later be accessed and “decoded” for various purposes, such as network measurement, accounting, anomaly detection and etc. Bloom filters and counter braids are two well-known representatives in this category. Those sketching algorithms usually need to strike a tradeoff between performance (how much information can be revealed and how fast) and cost (storage, transmission and computation). This dissertation is dedicated to the
research and development of several sketching techniques including improved forms of stateful Bloom Filters, Statistical Counter Arrays and Error Estimating Codes. Bloom filter is a space-efficient randomized data structure for approximately representing a set in order to support membership queries. Bloom filter and its variants have found widespread use in many networking applications, where it is important to minimize the cost of storing and communicating network data. In this thesis, we propose a family of Bloom Filter variants augmented by rank-indexing method. We will show such augmentation can bring a significant reduction of space and also the number of memory accesses, especially when deletions of set elements from the Bloom Filter need to be supported. Exact active counter array is another important building block in many sketching algorithms, where storage cost of the array is of paramount concern. Previous approaches reduce the storage costs while either losing accuracy or supporting only passive measurements. In this thesis, we propose an exact statistics counter array architecture that can support active measurements (real-time read and write). It also leverages the aforementioned rank-indexing method and exploits statistical multiplexing to minimize the storage
costs of the counter array. Error estimating coding (EEC) has recently been established as an important tool to estimate bit error rates in the transmission of packets over wireless links. In essence, the EEC problem is also a sketching problem, since the EEC codes can be viewed as a sketch of the packet sent, which is decoded by the receiver to estimate bit error rate. In this thesis, we will first investigate the asymptotic bound of error estimating coding by viewing the problem from two-party computation perspective and then investigate its coding/decoding efficiency using Fisher information analysis. Further, we develop several sketching techniques including Enhanced tug-of-war(EToW) sketch and the generalized EEC (gEEC)sketch family which can achieve around 70% reduction of sketch size with similar estimation accuracies. For all solutions proposed above, we will use theoretical tools such as information theory and communication complexity to investigate how far our proposed solutions are away from the theoretical optimal. We will show that the proposed techniques are asymptotically or empirically very close to the theoretical bounds.PhDCommittee Chair: Xu, Jun; Committee Member: Feamster, Nick; Committee Member: Li, Baochun; Committee Member: Romberg, Justin; Committee Member: Zegura, Ellen W
The IPAC Image Subtraction and Discovery Pipeline for the intermediate Palomar Transient Factory
We describe the near real-time transient-source discovery engine for the
intermediate Palomar Transient Factory (iPTF), currently in operations at the
Infrared Processing and Analysis Center (IPAC), Caltech. We coin this system
the IPAC/iPTF Discovery Engine (or IDE). We review the algorithms used for
PSF-matching, image subtraction, detection, photometry, and machine-learned
(ML) vetting of extracted transient candidates. We also review the performance
of our ML classifier. For a limiting signal-to-noise ratio of 4 in relatively
unconfused regions, "bogus" candidates from processing artifacts and imperfect
image subtractions outnumber real transients by ~ 10:1. This can be
considerably higher for image data with inaccurate astrometric and/or
PSF-matching solutions. Despite this occasionally high contamination rate, the
ML classifier is able to identify real transients with an efficiency (or
completeness) of ~ 97% for a maximum tolerable false-positive rate of 1% when
classifying raw candidates. All subtraction-image metrics, source features, ML
probability-based real-bogus scores, contextual metadata from other surveys,
and possible associations with known Solar System objects are stored in a
relational database for retrieval by the various science working groups. We
review our efforts in mitigating false-positives and our experience in
optimizing the overall system in response to the multitude of science projects
underway with iPTF.Comment: 66 pages, 21 figures, 7 tables, accepted by PAS
- …