5,137 research outputs found
Identifying Correlated Heavy-Hitters in a Two-Dimensional Data Stream
We consider online mining of correlated heavy-hitters from a data stream.
Given a stream of two-dimensional data, a correlated aggregate query first
extracts a substream by applying a predicate along a primary dimension, and
then computes an aggregate along a secondary dimension. Prior work on
identifying heavy-hitters in streams has almost exclusively focused on
identifying heavy-hitters on a single dimensional stream, and these yield
little insight into the properties of heavy-hitters along other dimensions. In
typical applications however, an analyst is interested not only in identifying
heavy-hitters, but also in understanding further properties such as: what other
items appear frequently along with a heavy-hitter, or what is the frequency
distribution of items that appear along with the heavy-hitters. We consider
queries of the following form: In a stream S of (x, y) tuples, on the substream
H of all x values that are heavy-hitters, maintain those y values that occur
frequently with the x values in H. We call this problem as Correlated
Heavy-Hitters (CHH). We formulate an approximate formulation of CHH
identification, and present an algorithm for tracking CHHs on a data stream.
The algorithm is easy to implement and uses workspace which is orders of
magnitude smaller than the stream itself. We present provable guarantees on the
maximum error, as well as detailed experimental results that demonstrate the
space-accuracy trade-off
Monitoring frequent items over distributed data streams.
Many important applications require the discovery of items which have occurred frequently. Knowledge of these items is commonly used in anomaly detection and network monitoring tasks. Effective solutions for this problem focus mainly on reducing memory requirements in a centralized environment. These solutions, however, ignore the inherently distributed nature of many systems. Naively forwarding data to a centralized location is not practical when dealing with high speed data streams and will result in significant communication overhead. This thesis proposes a new approach designed for continuously tracking frequent items over distributed data streams, providing either exact or approximate answers. The method introduced is a direct modification to an existing communication efficient algorithm called Top-K, Monitoring. Experimental results demonstrated that the proposed modifications significantly reduced communication cost and improved scalability. Also examined in this thesis is the applicability of frequent item monitoring at detecting distributed denial of service attacks. Simulation of the proposed tracking method against four different attack patterns was conducted. The outcome of these experiments showed promising results when compared to previous detection methods
An efficient closed frequent itemset miner for the MOA stream mining system
Mining itemsets is a central task in data mining, both in the batch and the streaming paradigms. While robust, efficient, and well-tested implementations exist for batch mining, hardly any publicly available equivalent exists for the streaming scenario. The lack of an efficient, usable tool for the task hinders its use by practitioners and makes it difficult to assess new research in the area. To alleviate this situation, we review the algorithms described in the literature, and implement and evaluate the IncMine algorithm by Cheng, Ke, and Ng (2008) for mining frequent closed itemsets from data streams. Our implementation works on top of the MOA (Massive Online Analysis) stream mining framework to ease its use and integration with other stream mining tasks. We provide a PAC-style rigorous analysis of the quality of the output of IncMine as a function of its parameters; this type of analysis is rare in pattern mining algorithms. As a by-product, the analysis shows how one of the user-provided parameters in the original description can be removed entirely while retaining the performance guarantees. Finally, we experimentally confirm both on synthetic and real data the excellent performance of the algorithm, as reported in the original paper, and its ability to handle concept drift.Postprint (published version
Efficient Summing over Sliding Windows
This paper considers the problem of maintaining statistic aggregates over the
last W elements of a data stream. First, the problem of counting the number of
1's in the last W bits of a binary stream is considered. A lower bound of
{\Omega}(1/{\epsilon} + log W) memory bits for W{\epsilon}-additive
approximations is derived. This is followed by an algorithm whose memory
consumption is O(1/{\epsilon} + log W) bits, indicating that the algorithm is
optimal and that the bound is tight. Next, the more general problem of
maintaining a sum of the last W integers, each in the range of {0,1,...,R}, is
addressed. The paper shows that approximating the sum within an additive error
of RW{\epsilon} can also be done using {\Theta}(1/{\epsilon} + log W) bits for
{\epsilon}={\Omega}(1/W). For {\epsilon}=o(1/W), we present a succinct
algorithm which uses B(1 + o(1)) bits, where B={\Theta}(Wlog(1/W{\epsilon})) is
the derived lower bound. We show that all lower bounds generalize to randomized
algorithms as well. All algorithms process new elements and answer queries in
O(1) worst-case time.Comment: A shorter version appears in SWAT 201
- …