215 research outputs found
A Case for Partitioned Bloom Filters
In a partitioned Bloom Filter the bit vector is split into disjoint
sized parts, one per hash function. Contrary to hardware designs, where
they prevail, software implementations mostly adopt standard Bloom filters,
considering partitioned filters slightly worse, due to the slightly larger
false positive rate (FPR). In this paper, by performing an in-depth analysis,
first we show that the FPR advantage of standard Bloom filters is smaller than
thought; more importantly, by studying the per-element FPR, we show that
standard Bloom filters have weak spots in the domain: elements which will be
tested as false positives much more frequently than expected. This is relevant
in scenarios where an element is tested against many filters, e.g., in packet
forwarding. Moreover, standard Bloom filters are prone to exhibit extremely
weak spots if naive double hashing is used, something occurring in several,
even mainstream, libraries. Partitioned Bloom filters exhibit a uniform
distribution of the FPR over the domain and are robust to the naive use of
double hashing, having no weak spots. Finally, by surveying several usages
other than testing set membership, we point out the many advantages of having
disjoint parts: they can be individually sampled, extracted, added or retired,
leading to superior designs for, e.g., SIMD usage, size reduction, test of set
disjointness, or duplicate detection in streams. Partitioned Bloom filters are
better, and should replace the standard form, both in general purpose libraries
and as the base for novel designs.Comment: 21 page
Retouched Bloom Filters: Allowing Networked Applications to Flexibly Trade Off False Positives Against False Negatives
Where distributed agents must share voluminous set membership information,
Bloom filters provide a compact, though lossy, way for them to do so. Numerous
recent networking papers have examined the trade-offs between the bandwidth
consumed by the transmission of Bloom filters, and the error rate, which takes
the form of false positives, and which rises the more the filters are
compressed. In this paper, we introduce the retouched Bloom filter (RBF), an
extension that makes the Bloom filter more flexible by permitting the removal
of selected false positives at the expense of generating random false
negatives. We analytically show that RBFs created through a random process
maintain an overall error rate, expressed as a combination of the false
positive rate and the false negative rate, that is equal to the false positive
rate of the corresponding Bloom filters. We further provide some simple
heuristics and improved algorithms that decrease the false positive rate more
than than the corresponding increase in the false negative rate, when creating
RBFs. Finally, we demonstrate the advantages of an RBF over a Bloom filter in a
distributed network topology measurement application, where information about
large stop sets must be shared among route tracing monitors.Comment: This is a new version of the technical reports with improved
algorithms and theorical analysis of algorithm
Towards Lifelong Reasoning with Sparse and Compressive Memory Systems
Humans have a remarkable ability to remember information over long time horizons. When reading a book, we build up a compressed representation of the past narrative, such as the characters and events that have built up the story so far. We can do this even if they are separated by thousands of words from the current text, or long stretches of time between readings. During our life, we build up and retain memories that tell us where we live, what we have experienced, and who we are. Adding memory to artificial neural networks has been transformative in machine learning, allowing models to extract structure from temporal data, and more accurately model the future. However the capacity for long-range reasoning in current memory-augmented neural networks is considerably limited, in comparison to humans, despite the access to powerful modern computers. This thesis explores two prominent approaches towards scaling artificial memories to lifelong capacity: sparse access and compressive memory structures. With sparse access, the inspection, retrieval, and updating of only a very small subset of pertinent memory is considered. It is found that sparse memory access is beneficial for learning, allowing for improved data-efficiency and improved generalisation. From a computational perspective - sparsity allows scaling to memories with millions of entities on a simple CPU-based machine. It is shown that memory systems that compress the past to a smaller set of representations reduce redundancy and can speed up the learning of rare classes and improve upon classical data-structures in database systems. Compressive memory architectures are also devised for sequence prediction tasks and are observed to significantly increase the state-of-the-art in modelling natural language
Efficient algorithms for passive network measurement
Network monitoring has become a necessity to aid in the management and operation of large networks. Passive network monitoring consists of extracting metrics (or any information of interest) by analyzing the traffic that traverses one or more network links. Extracting information from a high-speed network link is challenging, given the great data volumes and short packet inter-arrival times. These difficulties can be alleviated by using extremely efficient algorithms or by sampling the incoming traffic. This work improves the state of the art in both these approaches.
For one-way packet delay measurement, we propose a series of improvements over a recently appeared technique called Lossy Difference Aggregator. A main limitation of this technique is that it does not provide per-flow measurements. We propose a data structure called Lossy Difference Sketch that is capable of providing such per-flow delay measurements, and, unlike recent related works, does not rely on any model of packet delays.
In the problem of collecting measurements under the sliding window model, we focus on the estimation of the number of active flows and in traffic filtering. Using a common approach, we propose one algorithm for each problem that obtains great accuracy with significant resource savings.
In the traffic sampling area, the selection of the sampling rate is a crucial aspect. The most sensible approach involves dynamically adjusting sampling rates according to network traffic conditions, which is known as adaptive sampling. We propose an algorithm called Cuckoo Sampling that can operate with a fixed memory budget and perform adaptive flow-wise packet sampling. It is based on a very simple data structure and is computationally extremely lightweight.
The techniques presented in this work are thoroughly evaluated through a combination of theoretical and experimental analysis.Postprint (published version
- …