1,353 research outputs found
Distinct counting with a self-learning bitmap
Counting the number of distinct elements (cardinality) in a dataset is a
fundamental problem in database management. In recent years, due to many of its
modern applications, there has been significant interest to address the
distinct counting problem in a data stream setting, where each incoming data
can be seen only once and cannot be stored for long periods of time. Many
probabilistic approaches based on either sampling or sketching have been
proposed in the computer science literature, that only require limited
computing and memory resources. However, the performances of these methods are
not scale-invariant, in the sense that their relative root mean square
estimation errors (RRMSE) depend on the unknown cardinalities. This is not
desirable in many applications where cardinalities can be very dynamic or
inhomogeneous and many cardinalities need to be estimated. In this paper, we
develop a novel approach, called self-learning bitmap (S-bitmap) that is
scale-invariant for cardinalities in a specified range. S-bitmap uses a binary
vector whose entries are updated from 0 to 1 by an adaptive sampling process
for inferring the unknown cardinality, where the sampling rates are reduced
sequentially as more and more entries change from 0 to 1. We prove rigorously
that the S-bitmap estimate is not only unbiased but scale-invariant. We
demonstrate that to achieve a small RRMSE value of or less, our
approach requires significantly less memory and consumes similar or less
operations than state-of-the-art methods for many common practice cardinality
scales. Both simulation and experimental studies are reported.Comment: Journal of the American Statistical Association (accepted
Recommended from our members
ALGORITHMS FOR MASSIVE, EXPENSIVE, OR OTHERWISE INCONVENIENT GRAPHS
A long-standing assumption common in algorithm design is that any part of the input is accessible at any time for unit cost. However, as we work with increasingly large data sets, or as we build smaller devices, we must revisit this assumption. In this thesis, I present some of my work on graph algorithms designed for circumstances where traditional assumptions about inputs do not apply. 1. Classical graph algorithms require direct access to the input graph and this is not feasible when the graph is too large to fit in memory. For computation on massive graphs we consider the dynamic streaming graph model. Given an input graph defined by as a stream of edge insertions and deletions, our goal is to approximate properties of this graph using space that is sublinear in the size of the stream. In this thesis, I present algorithms for approximating vertex connectivity, hypergraph edge connectivity, maximum coverage, unique coverage, and temporal connectivity in graph streams. 2. In certain applications the input graph is not explicitly represented, but its edges may be discovered via queries which require costly computation or measurement. I present two open-source systems which solve real-world problems via graph algorithms which may access their inputs only through costly edge queries. M ESH is a memory manager which compacts memory efficiently by finding an approximate graph matching subject to stringent time and edge query restrictions. PathCache is an efficiently scalable network measurement platform that outperforms the current state of the art
IMAX: incremental maintenance of schema-based XML statistics
Journal ArticleCurrent approaches for estimating the cardinality of XML queries are applicable to a static scenario wherein the underlying XML data does not change subsequent to the collection of statistics on the repository. However, in practice, many XML-based applications are dynamic and involve frequent updates to the data. In this paper, we investigate efficient strategies for incrementally maintaining statistical summaries as and when updates are applied to the data. Specifically, we propose algorithms that handle both the addition of new documents as well as random insertions in the existing document trees. We also show, through a detailed performance evaluation, that our incremental techniques are significantly faster than the naive recomputation approach; and that estimation accuracy can be maintained even with a fixed memory budget
Recommended from our members
Streaming Algorithms Via Reductions
In the streaming algorithms model of computation we must process data in order and without enough memory to remember the entire input. We study reductions between problems in the streaming model with an eye to using reductions as an algorithm design technique. Our contributions include:
* Linear Transformation reductions, which compose with existing linear sketch techniques. We use these for small-space algorithms for numeric measurements of distance-from-periodicity, finding the period of a numeric stream, and detecting cyclic shifts.
* The first streaming graph algorithms in the sliding window\u27 model, where we must consider only the most recent L elements for some fixed threshold L. We develop basic algorithms for connectivity and unweighted maximum matching, then develop a variety of other algorithms via reductions to these problems.
* A new reduction from maximum weighted matching to maximum unweighted matching. This reduction immediately yields improved approximation guarantees for maximum weighted matching in the semistreaming, sliding window, and MapReduce models, and extends to the more general problem of finding maximum independent sets in p-systems.
* Algorithms in a stream-of-samples model which exhibit clear sample vs. space tradeoffs. These algorithms are also inspired by examining reductions. We provide algorithms for calculating F_k frequency moments and graph connectivity
A Parallel Space Saving Algorithm For Frequent Items and the Hurwitz zeta distribution
We present a message-passing based parallel version of the Space Saving
algorithm designed to solve the --majority problem. The algorithm determines
in parallel frequent items, i.e., those whose frequency is greater than a given
threshold, and is therefore useful for iceberg queries and many other different
contexts. We apply our algorithm to the detection of frequent items in both
real and synthetic datasets whose probability distribution functions are a
Hurwitz and a Zipf distribution respectively. Also, we compare its parallel
performances and accuracy against a parallel algorithm recently proposed for
merging summaries derived by the Space Saving or Frequent algorithms.Comment: Accepted for publication. To appear in Information Sciences,
Elsevier. http://www.sciencedirect.com/science/article/pii/S002002551500657
- …