Search CORE

106 research outputs found

Recursive Sketching For Frequency Moments

Author: Braverman Vladimir
Ostrovsky Rafail
Publication venue
Publication date: 11/11/2010
Field of study

In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to compute

F_k

(for

k>2

) in space complexity O(\mbox{\em poly-log}(n,m)\cdot n^{1-\frac2k}), which is optimal up to (large) poly-logarithmic factors in

n

and

m

, where

m

is the length of the stream and

n

is the upper bound on the number of distinct elements in a stream. The best known lower bound for large moments is

\Omega(\log(n)n^{1-\frac2k})

. A follow-up work of Bhuvanagiri, Ganguly, Kesh and Saha (SODA 2006) reduced the poly-logarithmic factors of Indyk and Woodruff to

O(\log^2(m)\cdot (\log n+ \log m)\cdot n^{1-{2\over k}})

. Further reduction of poly-log factors has been an elusive goal since 2006, when Indyk and Woodruff method seemed to hit a natural "barrier." Using our simple recursive sketch, we provide a different yet simple approach to obtain a

O(\log(m)\log(nm)\cdot (\log\log n)^4\cdot n^{1-{2\over k}})

algorithm for constant

\epsilon

(our bound is, in fact, somewhat stronger, where the

(\log\log n)

term can be replaced by any constant number of

\log

iterations instead of just two or three, thus approaching

log^*n

. Our bound also works for non-constant

\epsilon

(for details see the body of the paper). Further, our algorithm requires only

4

-wise independence, in contrast to existing methods that use pseudo-random generators for computing large frequency moments

arXiv.org e-Print Archive

CiteSeerX

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality

Author: Braverman Mark
Garg Ankit
Ma Tengyu
Nguyen Huy L.
Woodruff David P.
Publication venue
Publication date: 09/05/2016
Field of study

We study the tradeoff between the statistical error and communication cost of distributed statistical estimation problems in high dimensions. In the distributed sparse Gaussian mean estimation problem, each of the

m

machines receives

n

data points from a

d

-dimensional Gaussian distribution with unknown mean

\theta

which is promised to be

k

-sparse. The machines communicate by message passing and aim to estimate the mean

\theta

. We provide a tight (up to logarithmic factors) tradeoff between the estimation error and the number of bits communicated between the machines. This directly leads to a lower bound for the distributed \textit{sparse linear regression} problem: to achieve the statistical minimax error, the total communication is at least

\Omega(\min\{n,d\}m)

, where

n

is the number of observations that each machine receives and

d

is the ambient dimension. These lower results improve upon [Sha14,SD'14] by allowing multi-round iterative communication model. We also give the first optimal simultaneous protocol in the dense case for mean estimation. As our main technique, we prove a \textit{distributed data processing inequality}, as a generalization of usual data processing inequalities, which might be of independent interest and useful for other problems.Comment: To appear at STOC 2016. Fixed typos in theorem 4.5 and incorporated reviewers' suggestion

arXiv.org e-Print Archive

Princeton University Open Access Repository

Crossref