528 research outputs found
Identifying Correlated Heavy-Hitters in a Two-Dimensional Data Stream
We consider online mining of correlated heavy-hitters from a data stream.
Given a stream of two-dimensional data, a correlated aggregate query first
extracts a substream by applying a predicate along a primary dimension, and
then computes an aggregate along a secondary dimension. Prior work on
identifying heavy-hitters in streams has almost exclusively focused on
identifying heavy-hitters on a single dimensional stream, and these yield
little insight into the properties of heavy-hitters along other dimensions. In
typical applications however, an analyst is interested not only in identifying
heavy-hitters, but also in understanding further properties such as: what other
items appear frequently along with a heavy-hitter, or what is the frequency
distribution of items that appear along with the heavy-hitters. We consider
queries of the following form: In a stream S of (x, y) tuples, on the substream
H of all x values that are heavy-hitters, maintain those y values that occur
frequently with the x values in H. We call this problem as Correlated
Heavy-Hitters (CHH). We formulate an approximate formulation of CHH
identification, and present an algorithm for tracking CHHs on a data stream.
The algorithm is easy to implement and uses workspace which is orders of
magnitude smaller than the stream itself. We present provable guarantees on the
maximum error, as well as detailed experimental results that demonstrate the
space-accuracy trade-off
Weighted Reservoir Sampling from Distributed Streams
We consider message-efficient continuous random sampling from a distributed
stream, where the probability of inclusion of an item in the sample is
proportional to a weight associated with the item. The unweighted version,
where all weights are equal, is well studied, and admits tight upper and lower
bounds on message complexity. For weighted sampling with replacement, there is
a simple reduction to unweighted sampling with replacement. However, in many
applications the stream has only a few heavy items which may dominate a random
sample when chosen with replacement. Weighted sampling \textit{without
replacement} (weighted SWOR) eludes this issue, since such heavy items can be
sampled at most once.
In this work, we present the first message-optimal algorithm for weighted
SWOR from a distributed stream. Our algorithm also has optimal space and time
complexity. As an application of our algorithm for weighted SWOR, we derive the
first distributed streaming algorithms for tracking \textit{heavy hitters with
residual error}. Here the goal is to identify stream items that contribute
significantly to the residual stream, once the heaviest items are removed.
Residual heavy hitters generalize the notion of heavy hitters and are
important in streams that have a skewed distribution of weights. In addition to
the upper bound, we also provide a lower bound on the message complexity that
is nearly tight up to a factor. Finally, we use our weighted
sampling algorithm to improve the message complexity of distributed
tracking, also known as count tracking, which is a widely studied problem in
distributed streaming. We also derive a tight message lower bound, which closes
the message complexity of this fundamental problem.Comment: To appear in PODS 201
Efficient Summing over Sliding Windows
This paper considers the problem of maintaining statistic aggregates over the
last W elements of a data stream. First, the problem of counting the number of
1's in the last W bits of a binary stream is considered. A lower bound of
{\Omega}(1/{\epsilon} + log W) memory bits for W{\epsilon}-additive
approximations is derived. This is followed by an algorithm whose memory
consumption is O(1/{\epsilon} + log W) bits, indicating that the algorithm is
optimal and that the bound is tight. Next, the more general problem of
maintaining a sum of the last W integers, each in the range of {0,1,...,R}, is
addressed. The paper shows that approximating the sum within an additive error
of RW{\epsilon} can also be done using {\Theta}(1/{\epsilon} + log W) bits for
{\epsilon}={\Omega}(1/W). For {\epsilon}=o(1/W), we present a succinct
algorithm which uses B(1 + o(1)) bits, where B={\Theta}(Wlog(1/W{\epsilon})) is
the derived lower bound. We show that all lower bounds generalize to randomized
algorithms as well. All algorithms process new elements and answer queries in
O(1) worst-case time.Comment: A shorter version appears in SWAT 201
Efficient Identification of TOP-K Heavy Hitters over Sliding Windows
This is the author accepted manuscript. The final version is available from Springer Verlag via the DOI in this recordDue to the increasing volume of network traffic and growing complexity of network environment, rapid identification of heavy hitters is quite challenging. To deal with the massive data streams in real-time, accurate and scalable solution is required. The traditional method to keep an individual counter for each host in the whole data streams is very resource-consuming. This paper presents a new data structure called FCM and its associated algorithms. FCM combines the count-min sketch with the stream-summary structure simultaneously for efficient TOP-K heavy hitter identification in one pass. The key point of this algorithm is that it introduces a novel filter-and-jump mechanism. Given that the Internet traffic has the property of being heavy-tailed and hosts of low frequencies account for the majority of the IP addresses, FCM periodically filters the mice from input streams to efficiently improve the accuracy of TOP-K heavy hitter identification. On the other hand, considering that abnormal events are always time sensitive, our algorithm works by adjusting its measurement window to the newly arrived elements in the data streams automatically. Our experimental results demonstrate that the performance of FCM is superior to the previous related algorithm. Additionally this solution has a good prospect of application in advanced network environment.Chinese Academy of SciencesNational Natural Science Foundation of Chin
- …