113 research outputs found
Fast and Accurate Mining of Correlated Heavy Hitters
The problem of mining Correlated Heavy Hitters (CHH) from a two-dimensional
data stream has been introduced recently, and a deterministic algorithm based
on the use of the Misra--Gries algorithm has been proposed by Lahiri et al. to
solve it. In this paper we present a new counter-based algorithm for tracking
CHHs, formally prove its error bounds and correctness and show, through
extensive experimental results, that our algorithm outperforms the Misra--Gries
based algorithm with regard to accuracy and speed whilst requiring
asymptotically much less space
Relative Errors for Deterministic Low-Rank Matrix Approximations
We consider processing an n x d matrix A in a stream with row-wise updates
according to a recent algorithm called Frequent Directions (Liberty, KDD 2013).
This algorithm maintains an l x d matrix Q deterministically, processing each
row in O(d l^2) time; the processing time can be decreased to O(d l) with a
slight modification in the algorithm and a constant increase in space. We show
that if one sets l = k+ k/eps and returns Q_k, a k x d matrix that is the best
rank k approximation to Q, then we achieve the following properties: ||A -
A_k||_F^2 <= ||A||_F^2 - ||Q_k||_F^2 <= (1+eps) ||A - A_k||_F^2 and where
pi_{Q_k}(A) is the projection of A onto the rowspace of Q_k then ||A -
pi_{Q_k}(A)||_F^2 <= (1+eps) ||A - A_k||_F^2.
We also show that Frequent Directions cannot be adapted to a sparse version
in an obvious way that retains the l original rows of the matrix, as opposed to
a linear combination or sketch of the rows.Comment: 16 pages, 0 figure
Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation
We introduce and study a new data sketch for processing massive datasets. It
addresses two common problems: 1) computing a sum given arbitrary filter
conditions and 2) identifying the frequent items or heavy hitters in a data
set. For the former, the sketch provides unbiased estimates with state of the
art accuracy. It handles the challenging scenario when the data is
disaggregated so that computing the per unit metric of interest requires an
expensive aggregation. For example, the metric of interest may be total clicks
per user while the raw data is a click stream with multiple rows per user. Thus
the sketch is suitable for use in a wide range of applications including
computing historical click through rates for ad prediction, reporting user
metrics from event streams, and measuring network traffic for IP flows.
We prove and empirically show the sketch has good properties for both the
disaggregated subset sum estimation and frequent item problems. On i.i.d. data,
it not only picks out the frequent items but gives strongly consistent
estimates for the proportion of each frequent item. The resulting sketch
asymptotically draws a probability proportional to size sample that is optimal
for estimating sums over the data. For non i.i.d. data, we show that it
typically does much better than random sampling for the frequent item problem
and never does worse. For subset sum estimation, we show that even for
pathological sequences, the variance is close to that of an optimal sampling
design. Empirically, despite the disadvantage of operating on disaggregated
data, our method matches or bests priority sampling, a state of the art method
for pre-aggregated data and performs orders of magnitude better on skewed data
compared to uniform sampling. We propose extensions to the sketch that allow it
to be used in combining multiple data sets, in distributed systems, and for
time decayed aggregation
Parallel Streaming Frequency-Based Aggregates
We present efficient parallel streaming algorithms for fundamental frequency-based aggregates in both the sliding window and the infinite window settings. In the sliding window setting, we give a parallel algorithm for maintaining a space-bounded block counter (SBBC). Using SBBC, we derive algorithms for basic counting, frequency estimation, and heavy hitters that perform no more work than their best sequential counterparts. In the infinite window setting, we present algorithms for frequency estimation, heavy hitters, and count-min sketch. For both the infinite window and sliding window settings, our parallel algorithms process a minibatch of items using linear work and polylog parallel depth. We also prove a lower bound showing that the work of the parallel algorithm is optimal in the case of heavy hitters and frequency estimation. To our knowledge, these are the first parallel algorithms for these problems that are provably work efficient and have low depth
- …