We consider online mining of correlated heavy-hitters from a data stream.
Given a stream of two-dimensional data, a correlated aggregate query first
extracts a substream by applying a predicate along a primary dimension, and
then computes an aggregate along a secondary dimension. Prior work on
identifying heavy-hitters in streams has almost exclusively focused on
identifying heavy-hitters on a single dimensional stream, and these yield
little insight into the properties of heavy-hitters along other dimensions. In
typical applications however, an analyst is interested not only in identifying
heavy-hitters, but also in understanding further properties such as: what other
items appear frequently along with a heavy-hitter, or what is the frequency
distribution of items that appear along with the heavy-hitters. We consider
queries of the following form: In a stream S of (x, y) tuples, on the substream
H of all x values that are heavy-hitters, maintain those y values that occur
frequently with the x values in H. We call this problem as Correlated
Heavy-Hitters (CHH). We formulate an approximate formulation of CHH
identification, and present an algorithm for tracking CHHs on a data stream.
The algorithm is easy to implement and uses workspace which is orders of
magnitude smaller than the stream itself. We present provable guarantees on the
maximum error, as well as detailed experimental results that demonstrate the
space-accuracy trade-off