Search CORE

4,126 research outputs found

Identifying Correlated Heavy-Hitters in a Two-Dimensional Data Stream

Author: Lahiri Bibudh
Mukherjee Arko Provo
Tirthapura Srikanta
Publication venue
Publication date: 03/10/2013
Field of study

We consider online mining of correlated heavy-hitters from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: In a stream S of (x, y) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H. We call this problem as Correlated Heavy-Hitters (CHH). We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace which is orders of magnitude smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off

arXiv.org e-Print Archive

CiteSeerX

Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

Author: Demaine E. D.
Mitzenmacher M.
Publication venue
Publication date: 12/09/2017
Field of study

We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the former, the sketch provides unbiased estimates with state of the art accuracy. It handles the challenging scenario when the data is disaggregated so that computing the per unit metric of interest requires an expensive aggregation. For example, the metric of interest may be total clicks per user while the raw data is a click stream with multiple rows per user. Thus the sketch is suitable for use in a wide range of applications including computing historical click through rates for ad prediction, reporting user metrics from event streams, and measuring network traffic for IP flows. We prove and empirically show the sketch has good properties for both the disaggregated subset sum estimation and frequent item problems. On i.i.d. data, it not only picks out the frequent items but gives strongly consistent estimates for the proportion of each frequent item. The resulting sketch asymptotically draws a probability proportional to size sample that is optimal for estimating sums over the data. For non i.i.d. data, we show that it typically does much better than random sampling for the frequent item problem and never does worse. For subset sum estimation, we show that even for pathological sequences, the variance is close to that of an optimal sampling design. Empirically, despite the disadvantage of operating on disaggregated data, our method matches or bests priority sampling, a state of the art method for pre-aggregated data and performs orders of magnitude better on skewed data compared to uniform sampling. We propose extensions to the sketch that allow it to be used in combining multiple data sets, in distributed systems, and for time decayed aggregation

arXiv.org e-Print Archive

Crossref

Parallel mining of time-faded heavy hitters

Author: Cafaro Massimo
Epicoco Italo
Pulimeno Marco
Publication venue: 'Elsevier BV'
Publication date: 11/01/2017
Field of study

In this paper we present PFDCMSS (Parallel Forward Decay Count-Min Space Saving) which, to the best of our knowledge, is the world first message-passing parallel algorithm for mining time-faded heavy hitters. The algorithm is a parallel version of the recently published FDCMSS (Forward Decay Count-Min Space Saving) sequential algorithm. We formally prove its correctness by showing that the underlying data structure, a sketch augmented with a Space Saving stream summary holding exactly two counters, is mergeable. Whilst mergeability of traditional sketches derives immediately from theory, we show that, instead, merging our augmented sketch is non trivial. Nonetheless, the resulting parallel algorithm is fast and simple to implement. The very large volumes of modern datasets in the context of Big Data present new challenges that current sequential algorithms can not cope with; on the contrary, parallel computing enables near real time processing of very large datasets, which are growing at an unprecedented scale. Our algorithm's implementation, taking advantage of the MPI (Message Passing Interface) library, is portable, reliable and provides cutting-edge performance. Extensive experimental results confirm that PFDCMSS retains the extreme accuracy and error bound provided by FDCMSS whilst providing excellent parallel scalability. Our contributions are three-fold: (i) we prove the non trivial mergeability of the augmented sketch used in the FDCMSS algorithm; (ii) we derive PFDCMSS, a novel message-passing parallel algorithm; (iii) we experimentally prove that PFDCMSS is extremely accurate and scalable, allowing near real time processing of large datasets. The result supports both casual users and seasoned, professional scientists working on expert and intelligent systems

arXiv.org e-Print Archive

Archivio Istituzionale della Ricerca- Università del Salento

Distributed mining of time--faded heavy hitters

Author: Cafaro Massimo
Epicoco Italo
Pulimeno Marco
Publication venue
Publication date: 01/12/2018
Field of study

We present \textsc{P2PTFHH} (Peer--to--Peer Time--Faded Heavy Hitters) which, to the best of our knowledge, is the first distributed algorithm for mining time--faded heavy hitters on unstructured P2P networks. \textsc{P2PTFHH} is based on the \textsc{FDCMSS} (Forward Decay Count--Min Space-Saving) sequential algorithm, and efficiently exploits an averaging gossip protocol, by merging in each interaction the involved peers' underlying data structures. We formally prove the convergence and correctness properties of our distributed algorithm and show that it is fast and simple to implement. Extensive experimental results confirm that \textsc{P2PTFHH} retains the extreme accuracy and error bound provided by \textsc{FDCMSS} whilst showing excellent scalability. Our contributions are three-fold: (i) we prove that the averaging gossip protocol can be used jointly with our augmented sketch data structure for mining time--faded heavy hitters; (ii) we prove the error bounds on frequency estimation; (iii) we experimentally prove that \textsc{P2PTFHH} is extremely accurate and fast, allowing near real time processing of large datasets.Comment: arXiv admin note: text overlap with arXiv:1806.0658

arXiv.org e-Print Archive

Servicio de Difusión de la Creación Intelectual

Fast and Accurate Mining of Correlated Heavy Hitters

Author: Cafaro Massimo
Epicoco Italo
Pulimeno Marco
Publication venue
Publication date: 06/04/2017
Field of study

The problem of mining Correlated Heavy Hitters (CHH) from a two-dimensional data stream has been introduced recently, and a deterministic algorithm based on the use of the Misra--Gries algorithm has been proposed by Lahiri et al. to solve it. In this paper we present a new counter-based algorithm for tracking CHHs, formally prove its error bounds and correctness and show, through extensive experimental results, that our algorithm outperforms the Misra--Gries based algorithm with regard to accuracy and speed whilst requiring asymptotically much less space

arXiv.org e-Print Archive

Archivio Istituzionale della Ricerca- Università del Salento

Mining Frequent Item sets in Data Streams

Author: Dass Rajanish
Publication venue
Publication date
Field of study

Research Papers in Economics

on frequency estimation and detection of frequent items in time faded streams

Author: Giovanni Aloisio
Italo Epicoco
Marco Pulimeno
Massimo Cafaro
Publication venue
Publication date: 01/01/2017
Field of study

We deal with the problem of detecting frequent items in a stream under the constraint that items are weighted, and recent items must be weighted more than older ones. This kind of problem naturally arises in a wide class of applications in which recent data is considered more useful and valuable with regard to older, stale data. The weight assigned to an item is, therefore, a function of its arrival timestamp. As a consequence, whilst in traditional frequent item mining applications we need to estimate frequency counts, we are instead required to estimate decayed counts . These applications are said to work in the time fading model. Two sketch-based algorithms for processing time-decayed streams have been recently published independently near the end of 2016. The Filtered Space Saving with Quasi-Heap (FSSQ) algorithm, besides a sketch, also uses an additional data structure called quasi-heap to maintain frequent items. Forward Decay Count-Min Space Saving (FDCMSS), our algorithm, cleverly combines key ideas borrowed from forward decay, the Count-Min sketch and the Space Saving algorithm. Therefore, it makes sense to compare and contrast the two algorithms in order to fully understand their strengths and weaknesses. We show, through extensive experimental results, that FSSQ is better for detecting frequent items than for frequency estimation. The use of the quasi-heap data structure slows down the algorithm owing to the huge number of maintenance operations. Therefore, FSSQ may not be able to cope with high-speed data streams. FDCMSS is better suitable for frequency estimation; moreover, it is extremely fast and can be used in the context of high-speed data streams and for the detection of frequent items as well, since its recall is always greater than 99%, even when using an extremely tiny amount of space. Therefore, FDCMSS proves to be an overall good choice when considering jointly the recall, precision, average relative error and the speed

Open Access Repository

Archivio Istituzionale della Ricerca- Università del Salento