Search CORE

249 research outputs found

Recursive Sketching For Frequency Moments

Author: Braverman Vladimir
Ostrovsky Rafail
Publication venue
Publication date: 11/11/2010
Field of study

In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to compute

F_k

(for

k>2

) in space complexity O(\mbox{\em poly-log}(n,m)\cdot n^{1-\frac2k}), which is optimal up to (large) poly-logarithmic factors in

n

and

m

, where

m

is the length of the stream and

n

is the upper bound on the number of distinct elements in a stream. The best known lower bound for large moments is

\Omega(\log(n)n^{1-\frac2k})

. A follow-up work of Bhuvanagiri, Ganguly, Kesh and Saha (SODA 2006) reduced the poly-logarithmic factors of Indyk and Woodruff to

O(\log^2(m)\cdot (\log n+ \log m)\cdot n^{1-{2\over k}})

. Further reduction of poly-log factors has been an elusive goal since 2006, when Indyk and Woodruff method seemed to hit a natural "barrier." Using our simple recursive sketch, we provide a different yet simple approach to obtain a

O(\log(m)\log(nm)\cdot (\log\log n)^4\cdot n^{1-{2\over k}})

algorithm for constant

\epsilon

(our bound is, in fact, somewhat stronger, where the

(\log\log n)

term can be replaced by any constant number of

\log

iterations instead of just two or three, thus approaching

log^*n

. Our bound also works for non-constant

\epsilon

(for details see the body of the paper). Further, our algorithm requires only

4

-wise independence, in contrast to existing methods that use pseudo-random generators for computing large frequency moments

arXiv.org e-Print Archive

CiteSeerX

Max-stable sketches: estimation of Lp-norms, dominance norms and point queries for non-negative signals

Author: Stoev Stilian A.
Taqqu Murad S.
Publication venue
Publication date: 01/01/2010
Field of study

Max-stable random sketches can be computed efficiently on fast streaming positive data sets by using only sequential access to the data. They can be used to answer point and Lp-norm queries for the signal. There is an intriguing connection between the so-called p-stable (or sum-stable) and the max-stable sketches. Rigorous performance guarantees through error-probability estimates are derived and the algorithmic implementation is discussed

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

Pseudorandomness for Regular Branching Programs via Fourier Analysis

Author: A. Healy
C.J. Lu
E. Kaplan
E. Rozenman
G. Even
I. Haitner
J. Naor
J. Šíma
M. Saks
N. Linial
O. Reingold
P. Indyk
Publication venue
Publication date: 01/01/2013
Field of study

We present an explicit pseudorandom generator for oblivious, read-once, permutation branching programs of constant width that can read their input bits in any order. The seed length is

O(\log^2 n)

, where

n

is the length of the branching program. The previous best seed length known for this model was

n^{1/2+o(1)}

, which follows as a special case of a generator due to Impagliazzo, Meka, and Zuckerman (FOCS 2012) (which gives a seed length of

s^{1/2+o(1)}

for arbitrary branching programs of size

s

). Our techniques also give seed length

n^{1/2+o(1)}

for general oblivious, read-once branching programs of width

2^{n^{o(1)}}

, which is incomparable to the results of Impagliazzo et al.Our pseudorandom generator is similar to the one used by Gopalan et al. (FOCS 2012) for read-once CNFs, but the analysis is quite different; ours is based on Fourier analysis of branching programs. In particular, we show that an oblivious, read-once, regular branching program of width

w

has Fourier mass at most

(2w^2)^k

at level

k

, independent of the length of the program.Comment: RANDOM 201

arXiv.org e-Print Archive

Crossref

Fully decentralized computation of aggregates over data streams

Author: Adi Rosen
Becchetti Luca
Bordino Ilaria
Leonardi Stefano
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

In several emerging applications, data is collected in massive streams at several distributed points of observation. A basic and challenging task is to allow every node to monitor a neighbourhood of interest by issuing continuous aggregate queries on the streams observed in its vicinity. This class of algorithms is fully decentralized and diffusive in nature: collecting all data at few central nodes of the network is unfeasible in networks of low capability devices or in the presence of massive data sets. The main difficulty in designing diffusive algorithms is to cope with duplicate detections. These arise both from the observation of the same event at several nodes of the network and/or receipt of the same aggregated information along multiple paths of diffusion. In this paper, we consider fully decentralized algorithms that answer locally continuous aggregate queries on the number of distinct events, total number of events and the second frequency moment in the scenario outlined above. The proposed algorithms use in the worst case or on realistic distributions sublinear space at every node. We also propose strategies that minimize the communication needed to update the aggregates when new events are observed. We experimentally evaluate for the efficiency and accuracy of our algorithms on realistic simulated scenarios

Archivio della ricerca- Università di Roma La Sapienza

On Estimating the First Frequency Moment of Data Streams

Author: Ganguly Sumit
Kar Purushottam
Publication venue
Publication date: 01/01/2010
Field of study

Estimating the first moment of a data stream defined as F_1 = \sum_{i \in \{1, 2, \ldots, n\}} \abs{f_i} to within

1 \pm \epsilon

-relative error with high probability is a basic and influential problem in data stream processing. A tight space bound of

O(\epsilon^{-2} \log (mM))

is known from the work of [Kane-Nelson-Woodruff-SODA10]. However, all known algorithms for this problem require per-update stream processing time of

\Omega(\epsilon^{-2})

, with the only exception being the algorithm of [Ganguly-Cormode-RANDOM07] that requires per-update processing time of

O(\log^2(mM)(\log n))

albeit with sub-optimal space

O(\epsilon^{-3}\log^2(mM))

. In this paper, we present an algorithm for estimating

F_1

that achieves near-optimality in both space and update processing time. The space requirement is

O(\epsilon^{-2}(\log n + (\log \epsilon^{-1})\log(mM)))

and the per-update processing time is

O( (\log n)\log (\epsilon^{-1}))

.Comment: 12 page

arXiv.org e-Print Archive

CiteSeerX

Better Pseudorandom Generators from Milder Pseudorandom Restrictions

Author: Gopalan Parikshit
Meka Raghu
Reingold Omer
Trevisan Luca
Vadhan Salil
Publication venue
Publication date: 01/01/2012
Field of study

We present an iterative approach to constructing pseudorandom generators, based on the repeated application of mild pseudorandom restrictions. We use this template to construct pseudorandom generators for combinatorial rectangles and read-once CNFs and a hitting set generator for width-3 branching programs, all of which achieve near-optimal seed-length even in the low-error regime: We get seed-length O(log (n/epsilon)) for error epsilon. Previously, only constructions with seed-length O(\log^{3/2} n) or O(\log^2 n) were known for these classes with polynomially small error. The (pseudo)random restrictions we use are milder than those typically used for proving circuit lower bounds in that we only set a constant fraction of the bits at a time. While such restrictions do not simplify the functions drastically, we show that they can be derandomized using small-bias spaces.Comment: To appear in FOCS 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce

Author: Elgohary Ahmed
Farahat Ahmed K.
Kamel Mohamed S.
Karray Fakhri
Publication venue
Publication date: 29/01/2014
Field of study

The kernel

k

-means is an effective method for data clustering which extends the commonly-used

k

-means algorithm to work on a similarity matrix over complex data structures. The kernel

k

-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel

k

-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. In this paper, we are defining a family of kernel-based low-dimensional embeddings that allows for scaling kernel

k

-means on MapReduce via an efficient and unified parallelization strategy. Afterwards, we propose two methods for low-dimensional embedding that adhere to our definition of the embedding family. Exploiting the proposed parallelization strategy, we present two scalable MapReduce algorithms for kernel

k

-means. We demonstrate the effectiveness and efficiency of the proposed algorithms through an empirical evaluation on benchmark data sets.Comment: Appears in Proceedings of the SIAM International Conference on Data Mining (SDM), 201

arXiv.org e-Print Archive

CiteSeerX