4,257 research outputs found

    Improved Algorithms for Time Decay Streams

    Get PDF
    In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a coreset, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions. We also consider the exponential time decay model for k-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores O(k log(h Delta)+h) points where h is the half-life of the decay function and Delta is the aspect ratio of the dataset. Our techniques extend to k-means clustering and M-estimators as well

    Efficient Summing over Sliding Windows

    Get PDF
    This paper considers the problem of maintaining statistic aggregates over the last W elements of a data stream. First, the problem of counting the number of 1's in the last W bits of a binary stream is considered. A lower bound of {\Omega}(1/{\epsilon} + log W) memory bits for W{\epsilon}-additive approximations is derived. This is followed by an algorithm whose memory consumption is O(1/{\epsilon} + log W) bits, indicating that the algorithm is optimal and that the bound is tight. Next, the more general problem of maintaining a sum of the last W integers, each in the range of {0,1,...,R}, is addressed. The paper shows that approximating the sum within an additive error of RW{\epsilon} can also be done using {\Theta}(1/{\epsilon} + log W) bits for {\epsilon}={\Omega}(1/W). For {\epsilon}=o(1/W), we present a succinct algorithm which uses B(1 + o(1)) bits, where B={\Theta}(Wlog(1/W{\epsilon})) is the derived lower bound. We show that all lower bounds generalize to randomized algorithms as well. All algorithms process new elements and answer queries in O(1) worst-case time.Comment: A shorter version appears in SWAT 201

    Identifying Correlated Heavy-Hitters in a Two-Dimensional Data Stream

    Full text link
    We consider online mining of correlated heavy-hitters from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: In a stream S of (x, y) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H. We call this problem as Correlated Heavy-Hitters (CHH). We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace which is orders of magnitude smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off

    Optimal Elephant Flow Detection

    Full text link
    Monitoring the traffic volumes of elephant flows, including the total byte count per flow, is a fundamental capability for online network measurements. We present an asymptotically optimal algorithm for solving this problem in terms of both space and time complexity. This improves on previous approaches, which can only count the number of packets in constant time. We evaluate our work on real packet traces, demonstrating an up to X2.5 speedup compared to the best alternative.Comment: Accepted to IEEE INFOCOM 201

    Knowledge discovery in data streams

    Full text link
    Knowing what to do with the massive amount of data collected has always been an ongoing issue for many organizations. While data mining has been touted to be the solution, it has failed to deliver the impact despite its successes in many areas. One reason is that data mining algorithms were not designed for the real world, i.e., they usually assume a static view of the data and a stable execution environment where resources are abundant. The reality however is that data are constantly changing and the execution environment is dynamic. Hence, it becomes difficult for data mining to truly deliver timely and relevant results. Recently, the processing of stream data has received many attention. What is interesting is that the methodology to design stream-based algorithms may well be the solution to the above problem. In this entry, we discuss this issue and present an overview of recent works
    • …
    corecore