249 research outputs found
Recursive Sketching For Frequency Moments
In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to
compute (for ) in space complexity O(\mbox{\em poly-log}(n,m)\cdot
n^{1-\frac2k}), which is optimal up to (large) poly-logarithmic factors in
and , where is the length of the stream and is the upper bound on
the number of distinct elements in a stream. The best known lower bound for
large moments is . A follow-up work of
Bhuvanagiri, Ganguly, Kesh and Saha (SODA 2006) reduced the poly-logarithmic
factors of Indyk and Woodruff to . Further reduction of poly-log factors has been an elusive
goal since 2006, when Indyk and Woodruff method seemed to hit a natural
"barrier." Using our simple recursive sketch, we provide a different yet simple
approach to obtain a algorithm for constant (our bound is, in fact, somewhat
stronger, where the term can be replaced by any constant number
of iterations instead of just two or three, thus approaching .
Our bound also works for non-constant (for details see the body of
the paper). Further, our algorithm requires only -wise independence, in
contrast to existing methods that use pseudo-random generators for computing
large frequency moments
Max-stable sketches: estimation of Lp-norms, dominance norms and point queries for non-negative signals
Max-stable random sketches can be computed efficiently on fast streaming
positive data sets by using only sequential access to the data. They can be
used to answer point and Lp-norm queries for the signal. There is an intriguing
connection between the so-called p-stable (or sum-stable) and the max-stable
sketches. Rigorous performance guarantees through error-probability estimates
are derived and the algorithmic implementation is discussed
Pseudorandomness for Regular Branching Programs via Fourier Analysis
We present an explicit pseudorandom generator for oblivious, read-once,
permutation branching programs of constant width that can read their input bits
in any order. The seed length is , where is the length of the
branching program. The previous best seed length known for this model was
, which follows as a special case of a generator due to
Impagliazzo, Meka, and Zuckerman (FOCS 2012) (which gives a seed length of
for arbitrary branching programs of size ). Our techniques
also give seed length for general oblivious, read-once branching
programs of width , which is incomparable to the results of
Impagliazzo et al.Our pseudorandom generator is similar to the one used by
Gopalan et al. (FOCS 2012) for read-once CNFs, but the analysis is quite
different; ours is based on Fourier analysis of branching programs. In
particular, we show that an oblivious, read-once, regular branching program of
width has Fourier mass at most at level , independent of the
length of the program.Comment: RANDOM 201
Fully decentralized computation of aggregates over data streams
In several emerging applications, data is collected in massive streams at several distributed points of observation. A basic and challenging task is to allow every node to monitor a neighbourhood of interest by issuing continuous aggregate queries on the streams observed in its vicinity. This class of algorithms is fully decentralized and diffusive in nature: collecting all data at few central nodes of the network is unfeasible in networks of low capability devices or in the presence of massive data sets. The main difficulty in designing diffusive algorithms is to cope with duplicate detections. These arise both from the observation of the same event at several nodes of the network and/or receipt of the same aggregated information along multiple paths of diffusion. In this paper, we consider fully decentralized algorithms that answer locally continuous aggregate queries on the number of distinct events, total number of events and the second frequency moment in the scenario outlined above. The proposed algorithms use in the worst case or on realistic distributions sublinear space at every node. We also propose strategies that minimize the communication needed to update the aggregates when new events are observed. We experimentally evaluate for the efficiency and accuracy of our algorithms on realistic simulated scenarios
On Estimating the First Frequency Moment of Data Streams
Estimating the first moment of a data stream defined as F_1 = \sum_{i \in
\{1, 2, \ldots, n\}} \abs{f_i} to within -relative error with
high probability is a basic and influential problem in data stream processing.
A tight space bound of is known from the work of
[Kane-Nelson-Woodruff-SODA10]. However, all known algorithms for this problem
require per-update stream processing time of , with the
only exception being the algorithm of [Ganguly-Cormode-RANDOM07] that requires
per-update processing time of albeit with sub-optimal
space . In this paper, we present an algorithm for
estimating that achieves near-optimality in both space and update
processing time. The space requirement is and the per-update processing time is .Comment: 12 page
Better Pseudorandom Generators from Milder Pseudorandom Restrictions
We present an iterative approach to constructing pseudorandom generators,
based on the repeated application of mild pseudorandom restrictions. We use
this template to construct pseudorandom generators for combinatorial rectangles
and read-once CNFs and a hitting set generator for width-3 branching programs,
all of which achieve near-optimal seed-length even in the low-error regime: We
get seed-length O(log (n/epsilon)) for error epsilon. Previously, only
constructions with seed-length O(\log^{3/2} n) or O(\log^2 n) were known for
these classes with polynomially small error.
The (pseudo)random restrictions we use are milder than those typically used
for proving circuit lower bounds in that we only set a constant fraction of the
bits at a time. While such restrictions do not simplify the functions
drastically, we show that they can be derandomized using small-bias spaces.Comment: To appear in FOCS 201
Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce
The kernel -means is an effective method for data clustering which extends
the commonly-used -means algorithm to work on a similarity matrix over
complex data structures. The kernel -means algorithm is however
computationally very complex as it requires the complete data matrix to be
calculated and stored. Further, the kernelized nature of the kernel -means
algorithm hinders the parallelization of its computations on modern
infrastructures for distributed computing. In this paper, we are defining a
family of kernel-based low-dimensional embeddings that allows for scaling
kernel -means on MapReduce via an efficient and unified parallelization
strategy. Afterwards, we propose two methods for low-dimensional embedding that
adhere to our definition of the embedding family. Exploiting the proposed
parallelization strategy, we present two scalable MapReduce algorithms for
kernel -means. We demonstrate the effectiveness and efficiency of the
proposed algorithms through an empirical evaluation on benchmark data sets.Comment: Appears in Proceedings of the SIAM International Conference on Data
Mining (SDM), 201
- …