5,137 research outputs found

    Identifying Correlated Heavy-Hitters in a Two-Dimensional Data Stream

    Full text link
    We consider online mining of correlated heavy-hitters from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: In a stream S of (x, y) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H. We call this problem as Correlated Heavy-Hitters (CHH). We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace which is orders of magnitude smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off

    Monitoring frequent items over distributed data streams.

    Get PDF
    Many important applications require the discovery of items which have occurred frequently. Knowledge of these items is commonly used in anomaly detection and network monitoring tasks. Effective solutions for this problem focus mainly on reducing memory requirements in a centralized environment. These solutions, however, ignore the inherently distributed nature of many systems. Naively forwarding data to a centralized location is not practical when dealing with high speed data streams and will result in significant communication overhead. This thesis proposes a new approach designed for continuously tracking frequent items over distributed data streams, providing either exact or approximate answers. The method introduced is a direct modification to an existing communication efficient algorithm called Top-K, Monitoring. Experimental results demonstrated that the proposed modifications significantly reduced communication cost and improved scalability. Also examined in this thesis is the applicability of frequent item monitoring at detecting distributed denial of service attacks. Simulation of the proposed tracking method against four different attack patterns was conducted. The outcome of these experiments showed promising results when compared to previous detection methods

    An efficient closed frequent itemset miner for the MOA stream mining system

    Get PDF
    Mining itemsets is a central task in data mining, both in the batch and the streaming paradigms. While robust, efficient, and well-tested implementations exist for batch mining, hardly any publicly available equivalent exists for the streaming scenario. The lack of an efficient, usable tool for the task hinders its use by practitioners and makes it difficult to assess new research in the area. To alleviate this situation, we review the algorithms described in the literature, and implement and evaluate the IncMine algorithm by Cheng, Ke, and Ng (2008) for mining frequent closed itemsets from data streams. Our implementation works on top of the MOA (Massive Online Analysis) stream mining framework to ease its use and integration with other stream mining tasks. We provide a PAC-style rigorous analysis of the quality of the output of IncMine as a function of its parameters; this type of analysis is rare in pattern mining algorithms. As a by-product, the analysis shows how one of the user-provided parameters in the original description can be removed entirely while retaining the performance guarantees. Finally, we experimentally confirm both on synthetic and real data the excellent performance of the algorithm, as reported in the original paper, and its ability to handle concept drift.Postprint (published version

    Efficient Summing over Sliding Windows

    Get PDF
    This paper considers the problem of maintaining statistic aggregates over the last W elements of a data stream. First, the problem of counting the number of 1's in the last W bits of a binary stream is considered. A lower bound of {\Omega}(1/{\epsilon} + log W) memory bits for W{\epsilon}-additive approximations is derived. This is followed by an algorithm whose memory consumption is O(1/{\epsilon} + log W) bits, indicating that the algorithm is optimal and that the bound is tight. Next, the more general problem of maintaining a sum of the last W integers, each in the range of {0,1,...,R}, is addressed. The paper shows that approximating the sum within an additive error of RW{\epsilon} can also be done using {\Theta}(1/{\epsilon} + log W) bits for {\epsilon}={\Omega}(1/W). For {\epsilon}=o(1/W), we present a succinct algorithm which uses B(1 + o(1)) bits, where B={\Theta}(Wlog(1/W{\epsilon})) is the derived lower bound. We show that all lower bounds generalize to randomized algorithms as well. All algorithms process new elements and answer queries in O(1) worst-case time.Comment: A shorter version appears in SWAT 201
    • …
    corecore