Search CORE

1,353 research outputs found

Distinct counting with a self-learning bitmap

Author: Cao Jin
Chen Aiyou
Nguyen Tuan
Shepp Larry
Publication venue
Publication date: 08/07/2011
Field of study

Counting the number of distinct elements (cardinality) in a dataset is a fundamental problem in database management. In recent years, due to many of its modern applications, there has been significant interest to address the distinct counting problem in a data stream setting, where each incoming data can be seen only once and cannot be stored for long periods of time. Many probabilistic approaches based on either sampling or sketching have been proposed in the computer science literature, that only require limited computing and memory resources. However, the performances of these methods are not scale-invariant, in the sense that their relative root mean square estimation errors (RRMSE) depend on the unknown cardinalities. This is not desirable in many applications where cardinalities can be very dynamic or inhomogeneous and many cardinalities need to be estimated. In this paper, we develop a novel approach, called self-learning bitmap (S-bitmap) that is scale-invariant for cardinalities in a specified range. S-bitmap uses a binary vector whose entries are updated from 0 to 1 by an adaptive sampling process for inferring the unknown cardinality, where the sampling rates are reduced sequentially as more and more entries change from 0 to 1. We prove rigorously that the S-bitmap estimate is not only unbiased but scale-invariant. We demonstrate that to achieve a small RRMSE value of

\epsilon

or less, our approach requires significantly less memory and consumes similar or less operations than state-of-the-art methods for many common practice cardinality scales. Both simulation and experimental studies are reported.Comment: Journal of the American Statistical Association (accepted

arXiv.org e-Print Archive

Research Papers in Economics

Recommended from our members

ALGORITHMS FOR MASSIVE, EXPENSIVE, OR OTHERWISE INCONVENIENT GRAPHS

Author: Tench David
Publication venue: ScholarWorks@UMass Amherst
Publication date: 18/12/2020
Field of study

A long-standing assumption common in algorithm design is that any part of the input is accessible at any time for unit cost. However, as we work with increasingly large data sets, or as we build smaller devices, we must revisit this assumption. In this thesis, I present some of my work on graph algorithms designed for circumstances where traditional assumptions about inputs do not apply. 1. Classical graph algorithms require direct access to the input graph and this is not feasible when the graph is too large to fit in memory. For computation on massive graphs we consider the dynamic streaming graph model. Given an input graph defined by as a stream of edge insertions and deletions, our goal is to approximate properties of this graph using space that is sublinear in the size of the stream. In this thesis, I present algorithms for approximating vertex connectivity, hypergraph edge connectivity, maximum coverage, unique coverage, and temporal connectivity in graph streams. 2. In certain applications the input graph is not explicitly represented, but its edges may be discovered via queries which require costly computation or measurement. I present two open-source systems which solve real-world problems via graph algorithms which may access their inputs only through costly edge queries. M ESH is a memory manager which compacts memory efficiently by finding an approximate graph matching subject to stringent time and edge query restrictions. PathCache is an efficiently scalable network measurement platform that outperforms the current state of the art

ScholarWorks@UMass Amherst

IMAX: incremental maintenance of schema-based XML statistics

Author: Freire Juliana
Ramanath Maya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

Journal ArticleCurrent approaches for estimating the cardinality of XML queries are applicable to a static scenario wherein the underlying XML data does not change subsequent to the collection of statistics on the repository. However, in practice, many XML-based applications are dynamic and involve frequent updates to the data. In this paper, we investigate efficient strategies for incrementally maintaining statistical summaries as and when updates are applied to the data. Specifically, we propose algorithms that handle both the addition of new documents as well as random insertions in the existing document trees. We also show, through a detailed performance evaluation, that our incremental techniques are significantly faster than the naive recomputation approach; and that estimation accuracy can be maintained even with a fixed memory budget

The University of Utah: J. Willard Marriott Digital Library

Recommended from our members

Streaming Algorithms Via Reductions

Author: Crouch Michael S
Publication venue: ScholarWorks@UMass Amherst
Publication date: 12/11/2014
Field of study

In the streaming algorithms model of computation we must process data in order and without enough memory to remember the entire input. We study reductions between problems in the streaming model with an eye to using reductions as an algorithm design technique. Our contributions include: * Linear Transformation reductions, which compose with existing linear sketch techniques. We use these for small-space algorithms for numeric measurements of distance-from-periodicity, finding the period of a numeric stream, and detecting cyclic shifts. * The first streaming graph algorithms in the sliding window\u27 model, where we must consider only the most recent L elements for some fixed threshold L. We develop basic algorithms for connectivity and unweighted maximum matching, then develop a variety of other algorithms via reductions to these problems. * A new reduction from maximum weighted matching to maximum unweighted matching. This reduction immediately yields improved approximation guarantees for maximum weighted matching in the semistreaming, sliding window, and MapReduce models, and extends to the more general problem of finding maximum independent sets in p-systems. * Algorithms in a stream-of-samples model which exhibit clear sample vs. space tradeoffs. These algorithms are also inspired by examining reductions. We provide algorithms for calculating F_k frequency moments and graph connectivity

ScholarWorks@UMass Amherst

A Parallel Space Saving Algorithm For Frequent Items and the Hurwitz zeta distribution

Author: Cafaro Massimo
Pulimeno Marco
Tempesta Piergiulio
Publication venue: 'Elsevier BV'
Publication date: 11/09/2015
Field of study

We present a message-passing based parallel version of the Space Saving algorithm designed to solve the

k

--majority problem. The algorithm determines in parallel frequent items, i.e., those whose frequency is greater than a given threshold, and is therefore useful for iceberg queries and many other different contexts. We apply our algorithm to the detection of frequent items in both real and synthetic datasets whose probability distribution functions are a Hurwitz and a Zipf distribution respectively. Also, we compare its parallel performances and accuracy against a parallel algorithm recently proposed for merging summaries derived by the Space Saving or Frequent algorithms.Comment: Accepted for publication. To appear in Information Sciences, Elsevier. http://www.sciencedirect.com/science/article/pii/S002002551500657

arXiv.org e-Print Archive

Docta Complutense

Archivio Istituzionale della Ricerca- Università del Salento