18 research outputs found
AMS Without 4-Wise Independence on Product Domains
In their seminal work, Alon, Matias, and Szegedy introduced several sketching
techniques, including showing that 4-wise independence is sufficient to obtain
good approximations of the second frequency moment. In this work, we show that
their sketching technique can be extended to product domains by using
the product of 4-wise independent functions on . Our work extends that of
Indyk and McGregor, who showed the result for . Their primary motivation
was the problem of identifying correlations in data streams. In their model, a
stream of pairs arrive, giving a joint distribution ,
and they find approximation algorithms for how close the joint distribution is
to the product of the marginal distributions under various metrics, which
naturally corresponds to how close and are to being independent. By
using our technique, we obtain a new result for the problem of approximating
the distance between the joint distribution and the product of the
marginal distributions for -ary vectors, instead of just pairs, in a single
pass. Our analysis gives a randomized algorithm that is a
approximation (with probability ) that requires space logarithmic in
and and proportional to
Finding Subcube Heavy Hitters in Analytics Data Streams
Data streams typically have items of large number of dimensions. We study the
fundamental heavy-hitters problem in this setting. Formally, the data stream
consists of -dimensional items . A -dimensional
subcube is a subset of distinct coordinates . A subcube heavy hitter query , , outputs
YES if and NO if , where is the
ratio of number of stream items whose coordinates have joint values .
The all subcube heavy hitters query outputs all joint
values that return YES to . The one dimensional version
of this problem where was heavily studied in data stream theory,
databases, networking and signal processing. The subcube heavy hitters problem
is applicable in all these cases.
We present a simple reservoir sampling based one-pass streaming algorithm to
solve the subcube heavy hitters problem in space. This
is optimal up to poly-logarithmic factors given the established lower bound. In
the worst case, this is which is prohibitive for large
, and our goal is to circumvent this quadratic bottleneck.
Our main contribution is a model-based approach to the subcube heavy hitters
problem. In particular, we assume that the dimensions are related to each other
via the Naive Bayes model, with or without a latent dimension. Under this
assumption, we present a new two-pass, -space algorithm
for our problem, and a fast algorithm for answering in
time. Our work develops the direction of model-based data
stream analysis, with much that remains to be explored.Comment: To appear in WWW 201
Approximating Subadditive Hadamard Functions on Implicit Matrices
An important challenge in the streaming model is to maintain small-space
approximations of entrywise functions performed on a matrix that is generated
by the outer product of two vectors given as a stream. In other works, streams
typically define matrices in a standard way via a sequence of updates, as in
the work of Woodruff (2014) and others. We describe the matrix formed by the
outer product, and other matrices that do not fall into this category, as
implicit matrices. As such, we consider the general problem of computing over
such implicit matrices with Hadamard functions, which are functions applied
entrywise on a matrix. In this paper, we apply this generalization to provide
new techniques for identifying independence between two vectors in the
streaming model. The previous state of the art algorithm of Braverman and
Ostrovsky (2010) gave a -approximation for the distance
between the product and joint distributions, using space , where is the length of the stream and denotes the
size of the universe from which stream elements are drawn. Our general
techniques include the distance as a special case, and we give an
improved space bound of
Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space
We prove that Fp sketch, a well-celebrated streaming algorithm for frequency moments estimation, is differentially private as is when p ∈ (0, 1]. Fp sketch uses only polylogarithmic space, exponentially better than existing DP baselines and only worse than the optimal non-private baseline by a logarithmic factor. The evaluation shows that Fp sketch can achieve reasonable accuracy with differential privacy guarantee. The evaluation code is included in the supplementary material
Correlation clustering in data streams
In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on n nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, O(n·polylog n)-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the “quality” of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. However the standard LP and SDP formulations are not obviously solvable in O(n·polylog n)-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling. Note that the improved space and running-time bounds achieved from streaming algorithms are also useful for offline settings such as MapReduce models
Private Data Stream Analysis for Universal Symmetric Norm Estimation
We study how to release summary statistics on a data stream subject to the constraint of differential privacy. In particular, we focus on releasing the family of symmetric norms, which are invariant under sign-flips and coordinate-wise permutations on an input data stream and include L_p norms, k-support norms, top-k norms, and the box norm as special cases. Although it may be possible to design and analyze a separate mechanism for each symmetric norm, we propose a general parametrizable framework that differentially privately releases a number of sufficient statistics from which the approximation of all symmetric norms can be simultaneously computed. Our framework partitions the coordinates of the underlying frequency vector into different levels based on their magnitude and releases approximate frequencies for the "heavy" coordinates in important levels and releases approximate level sizes for the "light" coordinates in important levels. Surprisingly, our mechanism allows for the release of an arbitrary number of symmetric norm approximations without any overhead or additional loss in privacy. Moreover, our mechanism permits (1+?)-approximation to each of the symmetric norms and can be implemented using sublinear space in the streaming model for many regimes of the accuracy and privacy parameters
Recommended from our members
Correlation Clustering in Data Streams
Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as k-center, k-median, and k-means. Such algorithms need to be both time and and space efcient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on n nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in diferent clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, O(n ⋅ polylog n)-space approximation algorithms for natural problems that arise. We frst develop data structures based on linear sketches that allow the “quality” of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Unfortunately, the standard LP and SDP formulations are not obviously solvable in O(n ⋅ polylog n)-space. Our work presents space-efcient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling