39 research outputs found
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window
The past decade has witnessed many interesting algorithms for maintaining
statistics over a data stream. This paper initiates a theoretical study of
algorithms for monitoring distributed data streams over a time-based sliding
window (which contains a variable number of items and possibly out-of-order
items). The concern is how to minimize the communication between individual
streams and the root, while allowing the root, at any time, to be able to
report the global statistics of all streams within a given error bound. This
paper presents communication-efficient algorithms for three classical
statistics, namely, basic counting, frequent items and quantiles. The
worst-case communication cost over a window is bits for basic counting and words for the remainings, where is the number of distributed
data streams, is the total number of items in the streams that arrive or
expire in the window, and is the desired error bound. Matching
and nearly matching lower bounds are also obtained.Comment: 12 pages, to appear in the 27th International Symposium on
Theoretical Aspects of Computer Science (STACS), 201
Efficient Summing over Sliding Windows
This paper considers the problem of maintaining statistic aggregates over the
last W elements of a data stream. First, the problem of counting the number of
1's in the last W bits of a binary stream is considered. A lower bound of
{\Omega}(1/{\epsilon} + log W) memory bits for W{\epsilon}-additive
approximations is derived. This is followed by an algorithm whose memory
consumption is O(1/{\epsilon} + log W) bits, indicating that the algorithm is
optimal and that the bound is tight. Next, the more general problem of
maintaining a sum of the last W integers, each in the range of {0,1,...,R}, is
addressed. The paper shows that approximating the sum within an additive error
of RW{\epsilon} can also be done using {\Theta}(1/{\epsilon} + log W) bits for
{\epsilon}={\Omega}(1/W). For {\epsilon}=o(1/W), we present a succinct
algorithm which uses B(1 + o(1)) bits, where B={\Theta}(Wlog(1/W{\epsilon})) is
the derived lower bound. We show that all lower bounds generalize to randomized
algorithms as well. All algorithms process new elements and answer queries in
O(1) worst-case time.Comment: A shorter version appears in SWAT 201
An evaluation of streaming algorithms for distinct counting over a sliding window
Counting the number of distinct elements in a data stream (distinct counting) is a fundamental aggregation task in database query processing, query optimization, and network monitoring. On a stream of elements, it is commonly needed to compute an aggregate over only the most recent elements, leading to the problem of distinct counting over a “sliding window” of the stream. We present a detailed experimental study of the performance of different algorithms for distinct counting over a sliding window. We observe that the performance of an algorithm depends on the basic method used, as well as aspects such as the hash function, the mix of query and updates, and the method used to boost accuracy. We compare the performance of prominent algorithms and evaluate the influence of these factors, leading to practical recommendations for implementation. To the best of our knowledge, this is the first detailed experimental study of distinct counting over a sliding window
Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks
We show that randomization can lead to significant improvements for a few
fundamental problems in distributed tracking. Our basis is the {\em
count-tracking} problem, where there are players, each holding a counter
that gets incremented over time, and the goal is to track an
\eps-approximation of their sum continuously at all times,
using minimum communication. While the deterministic communication complexity
of the problem is \Theta(k/\eps \cdot \log N), where is the final value
of when the tracking finishes, we show that with randomization, the
communication cost can be reduced to \Theta(\sqrt{k}/\eps \cdot \log N). Our
algorithm is simple and uses only O(1) space at each player, while the lower
bound holds even assuming each player has infinite computing power. Then, we
extend our techniques to two related distributed tracking problems: {\em
frequency-tracking} and {\em rank-tracking}, and obtain similar improvements
over previous deterministic algorithms. Both problems are of central importance
in large data monitoring and analysis, and have been extensively studied in the
literature.Comment: 19 pages, 1 figur
FLEET: Butterfly Estimation from a Bipartite Graph Stream
We consider space-efficient single-pass estimation of the number of
butterflies, a fundamental bipartite graph motif, from a massive bipartite
graph stream where each edge represents a connection between entities in two
different partitions. We present a space lower bound for any streaming
algorithm that can estimate the number of butterflies accurately, as well as
FLEET, a suite of algorithms for accurately estimating the number of
butterflies in the graph stream. Estimates returned by the algorithms come with
provable guarantees on the approximation error, and experiments show good
tradeoffs between the space used and the accuracy of approximation. We also
present space-efficient algorithms for estimating the number of butterflies
within a sliding window of the most recent elements in the stream. While there
is a significant body of work on counting subgraphs such as triangles in a
unipartite graph stream, our work seems to be one of the few to tackle the case
of bipartite graph streams.Comment: This is the author's version of the work. It is posted here by
permission of ACM for your personal use. Not for redistribution. The
definitive version was published in Seyed-Vahid Sanei-Mehri, Yu Zhang, Ahmet
Erdem Sariyuce and Srikanta Tirthapura. "FLEET: Butterfly Estimation from a
Bipartite Graph Stream". The 28th ACM International Conference on Information
and Knowledge Managemen