19,388 research outputs found
Identifying Correlated Heavy-Hitters in a Two-Dimensional Data Stream
We consider online mining of correlated heavy-hitters from a data stream.
Given a stream of two-dimensional data, a correlated aggregate query first
extracts a substream by applying a predicate along a primary dimension, and
then computes an aggregate along a secondary dimension. Prior work on
identifying heavy-hitters in streams has almost exclusively focused on
identifying heavy-hitters on a single dimensional stream, and these yield
little insight into the properties of heavy-hitters along other dimensions. In
typical applications however, an analyst is interested not only in identifying
heavy-hitters, but also in understanding further properties such as: what other
items appear frequently along with a heavy-hitter, or what is the frequency
distribution of items that appear along with the heavy-hitters. We consider
queries of the following form: In a stream S of (x, y) tuples, on the substream
H of all x values that are heavy-hitters, maintain those y values that occur
frequently with the x values in H. We call this problem as Correlated
Heavy-Hitters (CHH). We formulate an approximate formulation of CHH
identification, and present an algorithm for tracking CHHs on a data stream.
The algorithm is easy to implement and uses workspace which is orders of
magnitude smaller than the stream itself. We present provable guarantees on the
maximum error, as well as detailed experimental results that demonstrate the
space-accuracy trade-off
Finding Subcube Heavy Hitters in Analytics Data Streams
Data streams typically have items of large number of dimensions. We study the
fundamental heavy-hitters problem in this setting. Formally, the data stream
consists of -dimensional items . A -dimensional
subcube is a subset of distinct coordinates . A subcube heavy hitter query , , outputs
YES if and NO if , where is the
ratio of number of stream items whose coordinates have joint values .
The all subcube heavy hitters query outputs all joint
values that return YES to . The one dimensional version
of this problem where was heavily studied in data stream theory,
databases, networking and signal processing. The subcube heavy hitters problem
is applicable in all these cases.
We present a simple reservoir sampling based one-pass streaming algorithm to
solve the subcube heavy hitters problem in space. This
is optimal up to poly-logarithmic factors given the established lower bound. In
the worst case, this is which is prohibitive for large
, and our goal is to circumvent this quadratic bottleneck.
Our main contribution is a model-based approach to the subcube heavy hitters
problem. In particular, we assume that the dimensions are related to each other
via the Naive Bayes model, with or without a latent dimension. Under this
assumption, we present a new two-pass, -space algorithm
for our problem, and a fast algorithm for answering in
time. Our work develops the direction of model-based data
stream analysis, with much that remains to be explored.Comment: To appear in WWW 201
Distributed Private Heavy Hitters
In this paper, we give efficient algorithms and lower bounds for solving the
heavy hitters problem while preserving differential privacy in the fully
distributed local model. In this model, there are n parties, each of which
possesses a single element from a universe of size N. The heavy hitters problem
is to find the identity of the most common element shared amongst the n
parties. In the local model, there is no trusted database administrator, and so
the algorithm must interact with each of the parties separately, using a
differentially private protocol. We give tight information-theoretic upper and
lower bounds on the accuracy to which this problem can be solved in the local
model (giving a separation between the local model and the more common
centralized model of privacy), as well as computationally efficient algorithms
even in the case where the data universe N may be exponentially large
Identifying correlated heavy-hitters in a two-dimensional data stream
We consider online mining of correlated heavy-hitters (CHH) from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: “In a stream S of (x, y) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H”. We call this problem as CHH. We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace much smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off
- …