Search CORE

270 research outputs found

Fast and Accurate Mining of Correlated Heavy Hitters

Author: Cafaro Massimo
Epicoco Italo
Pulimeno Marco
Publication venue
Publication date: 06/04/2017
Field of study

The problem of mining Correlated Heavy Hitters (CHH) from a two-dimensional data stream has been introduced recently, and a deterministic algorithm based on the use of the Misra--Gries algorithm has been proposed by Lahiri et al. to solve it. In this paper we present a new counter-based algorithm for tracking CHHs, formally prove its error bounds and correctness and show, through extensive experimental results, that our algorithm outperforms the Misra--Gries based algorithm with regard to accuracy and speed whilst requiring asymptotically much less space

arXiv.org e-Print Archive

Archivio Istituzionale della Ricerca- Università del Salento

Identifying Correlated Heavy-Hitters in a Two-Dimensional Data Stream

Author: Lahiri Bibudh
Mukherjee Arko Provo
Tirthapura Srikanta
Publication venue
Publication date: 03/10/2013
Field of study

We consider online mining of correlated heavy-hitters from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: In a stream S of (x, y) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H. We call this problem as Correlated Heavy-Hitters (CHH). We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace which is orders of magnitude smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off

arXiv.org e-Print Archive

CiteSeerX

Conditional heavy hitters : detecting interesting correlations in data streams

Author: Cormode Graham
Mirylenka Katsiaryna
Palpanas Themis
Srivastava Divesh
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/02/2015
Field of study

The notion of heavy hitters—items that make up a large fraction of the population—has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to capture the semantics we desire when we observe data in the form of correlated pairs. Here, we are interested in items that are conditionally frequent: when a particular item is frequent within the context of its parent item. In this work, we introduce and formalize the notion of conditional heavy hitters to identify such items, with applications in network monitoring and Markov chain modeling. We explore the relationship between conditional heavy hitters and other related notions in the literature, and show analytically and experimentally the usefulness of our approach. We introduce several algorithm variations that allow us to efficiently find conditional heavy hitters for input data with very different characteristics, and provide analytical results for their performance. Finally, we perform experimental evaluations with several synthetic and real datasets to demonstrate the efficacy of our methods and to study the behavior of the proposed algorithms for different types of data

Warwick Research Archives Portal Repository

Approximate Sparse Recovery: Optimizing Time and Measurements

Author: Gilbert Anna C.
Li Yi
Porat Ely
Strauss Martin J.
Publication venue
Publication date: 01/12/2009
Field of study

An approximate sparse recovery system consists of parameters

k,N

, an

m

-by-

N

measurement matrix,

\Phi

, and a decoding algorithm,

\mathcal{D}

. Given a vector,

x

, the system approximates

x

\widehat x =\mathcal{D}(\Phi x)

, which must satisfy

\| \widehat x - x\|_2\le C \|x - x_k\|_2

, where

x_k

denotes the optimal

k

-term approximation to

x

. For each vector

x

, the system must succeed with probability at least 3/4. Among the goals in designing such systems are minimizing the number

m

of measurements and the runtime of the decoding algorithm,

\mathcal{D}

. In this paper, we give a system with

m=O(k \log(N/k))

measurements--matching a lower bound, up to a constant factor--and decoding time

O(k\log^c N)

, matching a lower bound up to

\log(N)

factors. We also consider the encode time (i.e., the time to multiply

\Phi

x

), the time to update measurements (i.e., the time to multiply

\Phi

by a 1-sparse

x

), and the robustness and stability of the algorithm (adding noise before and after the measurements). Our encode and update times are optimal up to

\log(N)

factors

arXiv.org e-Print Archive

CiteSeerX