395 research outputs found

    A Case for Partitioned Bloom Filters

    Full text link
    In a partitioned Bloom Filter the mm bit vector is split into kk disjoint m/km/k sized parts, one per hash function. Contrary to hardware designs, where they prevail, software implementations mostly adopt standard Bloom filters, considering partitioned filters slightly worse, due to the slightly larger false positive rate (FPR). In this paper, by performing an in-depth analysis, first we show that the FPR advantage of standard Bloom filters is smaller than thought; more importantly, by studying the per-element FPR, we show that standard Bloom filters have weak spots in the domain: elements which will be tested as false positives much more frequently than expected. This is relevant in scenarios where an element is tested against many filters, e.g., in packet forwarding. Moreover, standard Bloom filters are prone to exhibit extremely weak spots if naive double hashing is used, something occurring in several, even mainstream, libraries. Partitioned Bloom filters exhibit a uniform distribution of the FPR over the domain and are robust to the naive use of double hashing, having no weak spots. Finally, by surveying several usages other than testing set membership, we point out the many advantages of having disjoint parts: they can be individually sampled, extracted, added or retired, leading to superior designs for, e.g., SIMD usage, size reduction, test of set disjointness, or duplicate detection in streams. Partitioned Bloom filters are better, and should replace the standard form, both in general purpose libraries and as the base for novel designs.Comment: 21 page

    Idempotent distributed counters using a Forgetful Bloom Filter

    Get PDF
    Distributed key-value stores power the backend of high-performance web services and cloud computing applications. Key-value stores such as Cassandra rely heavily on counters to keep track of the occurrences of various kinds of events. However, today's implementations of counters do not provide exactly-once semantics. A typical scenario is that a client requests a counter increment, times out waiting for a response, and creates a duplicate request, thus resulting in a double increment on the server side. In this thesis, we address this problem by presenting, analyzing, and evaluating a novel server-side data structure called the Forgetful Bloom Filter (FBF). Like a traditional Bloom filter, an FBF is a compact representation of a set of elements (e.g., client requests). However, an FBF is more powerful than a Bloom filter in two aspects: i) it can forget older elements (e.g., requests that are too old to apply), and ii) it is self-adapting under a varying workload. We also present an adaptive variant of FBF that adapts itself to meet a desired false positive rate -- thus the error achieved in the counter can be bounded even as the workload changes. We present experimental results from a prototype implementation of FBFs and discuss the implications for a key-value store such as Cassandra. Our results show that the FBF is highly accurate in maintaining correct counter values

    Practical Target-Based Synchronization Strategies for Immutable Time-Series Data Tables

    Get PDF
    As the Internet of Things and industrial monitoring of utilities grow, efficiently synchronizing immutable time-series data streams between databases becomes a pressing issue. Extracting data from critical production databases demands careful consideration of the stress imposed on the machines, so synchronization strategies are required to minimize the transfer of duplicate data and the load imposed on remote sources. Literature on the synchronization problem is generalized to arbitrary tables and does not consider the characteristics of time-series data streams, so research was required to investigate methods to quickly synchronize source and target time-series data tables. This thesis examines immutable time-series scenarios and synchronization strategies to answer the following question: given several scenarios, which target-based immutable time-series synchronization strategies best optimize run-time, bandwidth, and accuracy? The strategies explored in this research are implemented into the Meerschaum system, a project intended to leverage these time-series concepts for production deployments. As a practical demonstration, these strategies are used to continuously cache Clemson University’s utilities data
    • …
    corecore