Search CORE

18 research outputs found

AMS Without 4-Wise Independence on Product Domains

Author: Braverman Vladimir
Chung Kai-Min
Liu Zhenming
Mitzenmacher Michael
Ostrovsky Rafail
Publication venue
Publication date: 01/01/2010
Field of study

In their seminal work, Alon, Matias, and Szegedy introduced several sketching techniques, including showing that 4-wise independence is sufficient to obtain good approximations of the second frequency moment. In this work, we show that their sketching technique can be extended to product domains

[n]^k

by using the product of 4-wise independent functions on

[n]

. Our work extends that of Indyk and McGregor, who showed the result for

k = 2

. Their primary motivation was the problem of identifying correlations in data streams. In their model, a stream of pairs

(i,j) \in [n]^2

arrive, giving a joint distribution

(X,Y)

, and they find approximation algorithms for how close the joint distribution is to the product of the marginal distributions under various metrics, which naturally corresponds to how close

X

and

Y

are to being independent. By using our technique, we obtain a new result for the problem of approximating the

\ell_2

distance between the joint distribution and the product of the marginal distributions for

k

-ary vectors, instead of just pairs, in a single pass. Our analysis gives a randomized algorithm that is a

(1 \pm \epsilon)

approximation (with probability

1-\delta

) that requires space logarithmic in

n

and

m

and proportional to

3^k

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Finding Subcube Heavy Hitters in Analytics Data Streams

Author: Kveton Branislav
Muthukrishnan S.
Vu Hoa T.
Xian Yikun
Publication venue
Publication date: 01/01/2018
Field of study

Data streams typically have items of large number of dimensions. We study the fundamental heavy-hitters problem in this setting. Formally, the data stream consists of

d

-dimensional items

x_1,\ldots,x_m \in [n]^d

. A

k

-dimensional subcube

T

is a subset of distinct coordinates

\{ T_1,\cdots,T_k \} \subseteq [d]

. A subcube heavy hitter query

{\rm Query}(T,v)

v \in [n]^k

, outputs YES if

f_T(v) \geq \gamma

and NO if

f_T(v) < \gamma/4

, where

f_T

is the ratio of number of stream items whose coordinates

T

have joint values

v

. The all subcube heavy hitters query

{\rm AllQuery}(T)

outputs all joint values

v

that return YES to

{\rm Query}(T,v)

. The one dimensional version of this problem where

d=1

was heavily studied in data stream theory, databases, networking and signal processing. The subcube heavy hitters problem is applicable in all these cases. We present a simple reservoir sampling based one-pass streaming algorithm to solve the subcube heavy hitters problem in

\tilde{O}(kd/\gamma)

space. This is optimal up to poly-logarithmic factors given the established lower bound. In the worst case, this is

\Theta(d^2/\gamma)

which is prohibitive for large

d

, and our goal is to circumvent this quadratic bottleneck. Our main contribution is a model-based approach to the subcube heavy hitters problem. In particular, we assume that the dimensions are related to each other via the Naive Bayes model, with or without a latent dimension. Under this assumption, we present a new two-pass,

\tilde{O}(d/\gamma)

-space algorithm for our problem, and a fast algorithm for answering

{\rm AllQuery}(T)

O(k/\gamma^2)

time. Our work develops the direction of model-based data stream analysis, with much that remains to be explored.Comment: To appear in WWW 201

arXiv.org e-Print Archive

Crossref

Approximating Subadditive Hadamard Functions on Implicit Matrices

Author: Braverman Vladimir
Roytman Alan
Vorsanger Gregory
Publication venue
Publication date: 03/11/2015
Field of study

An important challenge in the streaming model is to maintain small-space approximations of entrywise functions performed on a matrix that is generated by the outer product of two vectors given as a stream. In other works, streams typically define matrices in a standard way via a sequence of updates, as in the work of Woodruff (2014) and others. We describe the matrix formed by the outer product, and other matrices that do not fall into this category, as implicit matrices. As such, we consider the general problem of computing over such implicit matrices with Hadamard functions, which are functions applied entrywise on a matrix. In this paper, we apply this generalization to provide new techniques for identifying independence between two vectors in the streaming model. The previous state of the art algorithm of Braverman and Ostrovsky (2010) gave a

(1 \pm \epsilon)

-approximation for the

L_1

distance between the product and joint distributions, using space

O(\log^{1024}(nm) \epsilon^{-1024})

, where

m

is the length of the stream and

n

denotes the size of the universe from which stream elements are drawn. Our general techniques include the

L_1

distance as a special case, and we give an improved space bound of

O(\log^{12}(n) \log^{2}({nm \over \epsilon})\epsilon^{-7})

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space

Author: Pinelis Iosif
Song Dawn
Wang Lun
Publication venue: Digital Commons @ Michigan Tech
Publication date: 27/09/2021
Field of study

We prove that Fp sketch, a well-celebrated streaming algorithm for frequency moments estimation, is differentially private as is when p ∈ (0, 1]. Fp sketch uses only polylogarithmic space, exponentially better than existing DP baselines and only worse than the optimal non-private baseline by a logarithmic factor. The evaluation shows that Fp sketch can achieve reasonable accuracy with differential privacy guarantee. The evaluation code is included in the supplementary material

arXiv.org e-Print Archive

Michigan Technological University

Correlation clustering in data streams

Author: Ahn Kook-Jin
Cormode Graham
Guha Sudipto
McGregor Andrew
Wirth Anthony Ian
Publication venue: 'Test accounts'
Publication date: 01/01/2015
Field of study

In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on n nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, O(n·polylog n)-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the “quality” of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. However the standard LP and SDP formulations are not obviously solvable in O(n·polylog n)-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling. Note that the improved space and running-time bounds achieved from streaming algorithms are also useful for offline settings such as MapReduce models

Warwick Research Archives Portal Repository

Private Data Stream Analysis for Universal Symmetric Norm Estimation

Author: Braverman Vladimir
Manning Joel
Wu Zhiwei Steven
Zhou Samson
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2023)
Publication date: 01/01/2023
Field of study

We study how to release summary statistics on a data stream subject to the constraint of differential privacy. In particular, we focus on releasing the family of symmetric norms, which are invariant under sign-flips and coordinate-wise permutations on an input data stream and include L_p norms, k-support norms, top-k norms, and the box norm as special cases. Although it may be possible to design and analyze a separate mechanism for each symmetric norm, we propose a general parametrizable framework that differentially privately releases a number of sufficient statistics from which the approximation of all symmetric norms can be simultaneously computed. Our framework partitions the coordinates of the underlying frequency vector into different levels based on their magnitude and releases approximate frequencies for the "heavy" coordinates in important levels and releases approximate level sizes for the "light" coordinates in important levels. Surprisingly, our mechanism allows for the release of an arbitrary number of symmetric norm approximations without any overhead or additional loss in privacy. Moreover, our mechanism permits (1+?)-approximation to each of the symmetric norms and can be implemented using sublinear space in the streaming model for many regimes of the accuracy and privacy parameters

Dagstuhl Research Online Publication Server

Recommended from our members

Correlation Clustering in Data Streams

Author: Ahn Kook Jin
Cormode Graham
Guha Sudipto
McGregor Andrew
Wirth Anthony
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2021
Field of study

Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as k-center, k-median, and k-means. Such algorithms need to be both time and and space efcient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on n nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in diferent clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, O(n ⋅ polylog n)-space approximation algorithms for natural problems that arise. We frst develop data structures based on linear sketches that allow the “quality” of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Unfortunately, the standard LP and SDP formulations are not obviously solvable in O(n ⋅ polylog n)-space. Our work presents space-efcient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling

ScholarWorks@UMass Amherst

Warwick Research Archives Portal Repository

University of Melbourne Institutional Repository