270 research outputs found

    Fast and Accurate Mining of Correlated Heavy Hitters

    Full text link
    The problem of mining Correlated Heavy Hitters (CHH) from a two-dimensional data stream has been introduced recently, and a deterministic algorithm based on the use of the Misra--Gries algorithm has been proposed by Lahiri et al. to solve it. In this paper we present a new counter-based algorithm for tracking CHHs, formally prove its error bounds and correctness and show, through extensive experimental results, that our algorithm outperforms the Misra--Gries based algorithm with regard to accuracy and speed whilst requiring asymptotically much less space

    Identifying Correlated Heavy-Hitters in a Two-Dimensional Data Stream

    Full text link
    We consider online mining of correlated heavy-hitters from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: In a stream S of (x, y) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H. We call this problem as Correlated Heavy-Hitters (CHH). We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace which is orders of magnitude smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off

    Conditional heavy hitters : detecting interesting correlations in data streams

    Get PDF
    The notion of heavy hitters—items that make up a large fraction of the population—has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to capture the semantics we desire when we observe data in the form of correlated pairs. Here, we are interested in items that are conditionally frequent: when a particular item is frequent within the context of its parent item. In this work, we introduce and formalize the notion of conditional heavy hitters to identify such items, with applications in network monitoring and Markov chain modeling. We explore the relationship between conditional heavy hitters and other related notions in the literature, and show analytically and experimentally the usefulness of our approach. We introduce several algorithm variations that allow us to efficiently find conditional heavy hitters for input data with very different characteristics, and provide analytical results for their performance. Finally, we perform experimental evaluations with several synthetic and real datasets to demonstrate the efficacy of our methods and to study the behavior of the proposed algorithms for different types of data

    Approximate Sparse Recovery: Optimizing Time and Measurements

    Full text link
    An approximate sparse recovery system consists of parameters k,Nk,N, an mm-by-NN measurement matrix, Φ\Phi, and a decoding algorithm, D\mathcal{D}. Given a vector, xx, the system approximates xx by x^=D(Φx)\widehat x =\mathcal{D}(\Phi x), which must satisfy x^x2Cxxk2\| \widehat x - x\|_2\le C \|x - x_k\|_2, where xkx_k denotes the optimal kk-term approximation to xx. For each vector xx, the system must succeed with probability at least 3/4. Among the goals in designing such systems are minimizing the number mm of measurements and the runtime of the decoding algorithm, D\mathcal{D}. In this paper, we give a system with m=O(klog(N/k))m=O(k \log(N/k)) measurements--matching a lower bound, up to a constant factor--and decoding time O(klogcN)O(k\log^c N), matching a lower bound up to log(N)\log(N) factors. We also consider the encode time (i.e., the time to multiply Φ\Phi by xx), the time to update measurements (i.e., the time to multiply Φ\Phi by a 1-sparse xx), and the robustness and stability of the algorithm (adding noise before and after the measurements). Our encode and update times are optimal up to log(N)\log(N) factors
    corecore