Search CORE

13 research outputs found

Taylor Polynomial Estimator for Estimating Frequency Moments

Author: A Andoni
M Charikar
N Alon
P Indyk
R Singh
S Ganguly
Y Li
Publication venue
Publication date: 03/06/2015
Field of study

We present a randomized algorithm for estimating the

p

th moment

F_p

of the frequency vector of a data stream in the general update (turnstile) model to within a multiplicative factor of

1 \pm \epsilon

, for

p > 2

, with high constant confidence. For

0 < \epsilon \le 1

, the algorithm uses space

O( n^{1-2/p} \epsilon^{-2} + n^{1-2/p} \epsilon^{-4/p} \log (n))

words. This improves over the current bound of

O(n^{1-2/p} \epsilon^{-2-4/p} \log (n))

words by Andoni et. al. in \cite{ako:arxiv10}. Our space upper bound matches the lower bound of Li and Woodruff \cite{liwood:random13} for

\epsilon = (\log (n))^{-\Omega(1)}

and the lower bound of Andoni et. al. \cite{anpw:icalp13} for

\epsilon = \Omega(1)

.Comment: Supercedes arXiv:1104.4552. Extended Abstract of this paper to appear in Proceedings of ICALP 201

arXiv.org e-Print Archive

Crossref

Approximating Approximate Pattern Matching

Author: Studený Jan
Uznański Przemysław
Publication venue
Publication date: 01/01/2019
Field of study

Given a text

T

of length

n

and a pattern

P

of length

m

, the approximate pattern matching problem asks for computation of a particular \emph{distance} function between

P

and every

m

-substring of

T

. We consider a

(1\pm\varepsilon)

multiplicative approximation variant of this problem, for

\ell_p

distance function. In this paper, we describe two

(1+\varepsilon)

-approximate algorithms with a runtime of

\widetilde{O}(\frac{n}{\varepsilon})

for all (constant) non-negative values of

p

. For constant

p \ge 1

we show a deterministic

(1+\varepsilon)

-approximation algorithm. Previously, such run time was known only for the case of

\ell_1

distance, by Gawrychowski and Uzna\'nski [ICALP 2018] and only with a randomized algorithm. For constant

0 \le p \le 1

we show a randomized algorithm for the

\ell_p

, thereby providing a smooth tradeoff between algorithms of Kopelowitz and Porat [FOCS~2015, SOSA~2018] for Hamming distance (case of

p=0

) and of Gawrychowski and Uzna\'nski for

\ell_1

distance

arXiv.org e-Print Archive

Repository for Publications and Research Data

Distributed Data Summarization in Well-Connected Networks

Author: Su Hsin-Hao
Vu Hoa T.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 33rd International Symposium on Distributed Computing (DISC 2019)
Publication date: 01/01/2019
Field of study

We study distributed algorithms for some fundamental problems in data summarization. Given a communication graph G of n nodes each of which may hold a value initially, we focus on computing sum_{i=1}^N g(f_i), where f_i is the number of occurrences of value i and g is some fixed function. This includes important statistics such as the number of distinct elements, frequency moments, and the empirical entropy of the data. In the CONGEST~ model, a simple adaptation from streaming lower bounds shows that it requires Omega~(D+ n) rounds, where D is the diameter of the graph, to compute some of these statistics exactly. However, these lower bounds do not hold for graphs that are well-connected. We give an algorithm that computes sum_{i=1}^{N} g(f_i) exactly in {tau_{G}} * 2^{O(sqrt{log n})} rounds where {tau_{G}} is the mixing time of G. This also has applications in computing the top k most frequent elements. We demonstrate that there is a high similarity between the GOSSIP~ model and the CONGEST~ model in well-connected graphs. In particular, we show that each round of the GOSSIP~ model can be simulated almost perfectly in O~({tau_{G}}) rounds of the CONGEST~ model. To this end, we develop a new algorithm for the GOSSIP~ model that 1 +/- epsilon approximates the p-th frequency moment F_p = sum_{i=1}^N f_i^p in O~(epsilon^{-2} n^{1-k/p}) roundsfor p >= 2, when the number of distinct elements F_0 is at most O(n^{1/(k-1)}). This result can be translated back to the CONGEST~ model with a factor O~({tau_{G}}) blow-up in the number of rounds

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Approximating Approximate Pattern Matching

Author: Uznanski Przemyslaw
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)
Publication date: 01/01/2019
Field of study

Given a text T of length n and a pattern P of length m, the approximate pattern matching problem asks for computation of a particular distance function between P and every m-substring of T. We consider a (1 +/- epsilon) multiplicative approximation variant of this problem, for l_p distance function. In this paper, we describe two (1+epsilon)-approximate algorithms with a runtime of O~(n/epsilon) for all (constant) non-negative values of p. For constant p >= 1 we show a deterministic (1+epsilon)-approximation algorithm. Previously, such run time was known only for the case of l_1 distance, by Gawrychowski and Uznanski [ICALP 2018] and only with a randomized algorithm. For constant 0 <= p <= 1 we show a randomized algorithm for the l_p, thereby providing a smooth tradeoff between algorithms of Kopelowitz and Porat [FOCS 2015, SOSA 2018] for Hamming distance (case of p=0) and of Gawrychowski and Uznanski for l_1 distance

Dagstuhl Research Online Publication Server

Continuous Monitoring of l_p Norms in Data Streams

Author: Blasiok Jaroslaw
Ding Jian
Nelson Jelani
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2017)
Publication date: 01/01/2017
Field of study

In insertion-only streaming, one sees a sequence of indices a_1, a_2, ..., a_m in [n]. The stream defines a sequence of m frequency vectors x(1), ..., x(m) each in R^n, where x(t) is the frequency vector of items after seeing the first t indices in the stream. Much work in the streaming literature focuses on estimating some function f(x(m)). Many applications though require obtaining estimates at time t of f(x(t)), for every t in [m]. Naively this guarantee is obtained by devising an algorithm with failure probability less than 1/m, then performing a union bound over all stream updates to guarantee that all m estimates are simultaneously accurate with good probability. When f(x) is some l_p norm of x, recent works have shown that this union bound is wasteful and better space complexity is possible for the continuous monitoring problem, with the strongest known results being for p=2. In this work, we improve the state of the art for all 0<p<2, which we obtain via a novel analysis of Indyk\u27s p-stable sketch

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

High Probability Frequency Moment Sketches

Author: Ganguly Sumit
Woodruff David P.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)
Publication date: 01/01/2018
Field of study

We consider the problem of sketching the p-th frequency moment of a vector, p>2, with multiplicative error at most 1 +/- epsilon and with high confidence 1-delta. Despite the long sequence of work on this problem, tight bounds on this quantity are only known for constant delta. While one can obtain an upper bound with error probability delta by repeating a sketching algorithm with constant error probability O(log(1/delta)) times in parallel, and taking the median of the outputs, we show this is a suboptimal algorithm! Namely, we show optimal upper and lower bounds of Theta(n^{1-2/p} log(1/delta) + n^{1-2/p} log^{2/p} (1/delta) log n) on the sketching dimension, for any constant approximation. Our result should be contrasted with results for estimating frequency moments for 1 <= p <= 2, for which we show the optimal algorithm for general delta is obtained by repeating the optimal algorithm for constant error probability O(log(1/delta)) times and taking the median output. We also obtain a matching lower bound for this problem, up to constant factors

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Private Data Stream Analysis for Universal Symmetric Norm Estimation

Author: Braverman Vladimir
Manning Joel
Wu Zhiwei Steven
Zhou Samson
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2023)
Publication date: 01/01/2023
Field of study

We study how to release summary statistics on a data stream subject to the constraint of differential privacy. In particular, we focus on releasing the family of symmetric norms, which are invariant under sign-flips and coordinate-wise permutations on an input data stream and include L_p norms, k-support norms, top-k norms, and the box norm as special cases. Although it may be possible to design and analyze a separate mechanism for each symmetric norm, we propose a general parametrizable framework that differentially privately releases a number of sufficient statistics from which the approximation of all symmetric norms can be simultaneously computed. Our framework partitions the coordinates of the underlying frequency vector into different levels based on their magnitude and releases approximate frequencies for the "heavy" coordinates in important levels and releases approximate level sizes for the "light" coordinates in important levels. Surprisingly, our mechanism allows for the release of an arbitrary number of symmetric norm approximations without any overhead or additional loss in privacy. Moreover, our mechanism permits (1+?)-approximation to each of the symmetric norms and can be implemented using sublinear space in the streaming model for many regimes of the accuracy and privacy parameters

Dagstuhl Research Online Publication Server

Private Data Stream Analysis for Universal Symmetric Norm Estimation

Author: Braverman Vladimir
Manning Joel
Wu Zhiwei Steven
Zhou Samson
Publication venue
Publication date: 09/07/2023
Field of study

L_p

norms,

k

-support norms, top-

k

norms, and the box norm as special cases. Although it may be possible to design and analyze a separate mechanism for each symmetric norm, we propose a general parametrizable framework that differentially privately releases a number of sufficient statistics from which the approximation of all symmetric norms can be simultaneously computed. Our framework partitions the coordinates of the underlying frequency vector into different levels based on their magnitude and releases approximate frequencies for the "heavy" coordinates in important levels and releases approximate level sizes for the "light" coordinates in important levels. Surprisingly, our mechanism allows for the release of an arbitrary number of symmetric norm approximations without any overhead or additional loss in privacy. Moreover, our mechanism permits

(1+\alpha)

-approximation to each of the symmetric norms and can be implemented using sublinear space in the streaming model for many regimes of the accuracy and privacy parameters

arXiv.org e-Print Archive