2,326 research outputs found
Recursive Sketching For Frequency Moments
In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to
compute (for ) in space complexity O(\mbox{\em poly-log}(n,m)\cdot
n^{1-\frac2k}), which is optimal up to (large) poly-logarithmic factors in
and , where is the length of the stream and is the upper bound on
the number of distinct elements in a stream. The best known lower bound for
large moments is . A follow-up work of
Bhuvanagiri, Ganguly, Kesh and Saha (SODA 2006) reduced the poly-logarithmic
factors of Indyk and Woodruff to . Further reduction of poly-log factors has been an elusive
goal since 2006, when Indyk and Woodruff method seemed to hit a natural
"barrier." Using our simple recursive sketch, we provide a different yet simple
approach to obtain a algorithm for constant (our bound is, in fact, somewhat
stronger, where the term can be replaced by any constant number
of iterations instead of just two or three, thus approaching .
Our bound also works for non-constant (for details see the body of
the paper). Further, our algorithm requires only -wise independence, in
contrast to existing methods that use pseudo-random generators for computing
large frequency moments
Towards Optimal Moment Estimation in Streaming and Distributed Models
One of the oldest problems in the data stream model is to approximate the p-th moment ||X||_p^p = sum_{i=1}^n X_i^p of an underlying non-negative vector X in R^n, which is presented as a sequence of poly(n) updates to its coordinates. Of particular interest is when p in (0,2]. Although a tight space bound of Theta(epsilon^-2 log n) bits is known for this problem when both positive and negative updates are allowed, surprisingly there is still a gap in the space complexity of this problem when all updates are positive. Specifically, the upper bound is O(epsilon^-2 log n) bits, while the lower bound is only Omega(epsilon^-2 + log n) bits. Recently, an upper bound of O~(epsilon^-2 + log n) bits was obtained under the assumption that the updates arrive in a random order.
We show that for p in (0, 1], the random order assumption is not needed. Namely, we give an upper bound for worst-case streams of O~(epsilon^-2 + log n) bits for estimating |X |_p^p. Our techniques also give new upper bounds for estimating the empirical entropy in a stream. On the other hand, we show that for p in (1,2], in the natural coordinator and blackboard distributed communication topologies, there is an O~(epsilon^-2) bit max-communication upper bound based on a randomized rounding scheme. Our protocols also give rise to protocols for heavy hitters and approximate matrix product. We generalize our results to arbitrary communication topologies G, obtaining an O~(epsilon^2 log d) max-communication upper bound, where d is the diameter of G. Interestingly, our upper bound rules out natural communication complexity-based approaches for proving an Omega(epsilon^-2 log n) bit lower bound for p in (1,2] for streaming algorithms. In particular, any such lower bound must come from a topology with large diameter
Quantized Compressive K-Means
The recent framework of compressive statistical learning aims at designing
tractable learning algorithms that use only a heavily compressed
representation-or sketch-of massive datasets. Compressive K-Means (CKM) is such
a method: it estimates the centroids of data clusters from pooled, non-linear,
random signatures of the learning examples. While this approach significantly
reduces computational time on very large datasets, its digital implementation
wastes acquisition resources because the learning examples are compressed only
after the sensing stage. The present work generalizes the sketching procedure
initially defined in Compressive K-Means to a large class of periodic
nonlinearities including hardware-friendly implementations that compressively
acquire entire datasets. This idea is exemplified in a Quantized Compressive
K-Means procedure, a variant of CKM that leverages 1-bit universal quantization
(i.e. retaining the least significant bit of a standard uniform quantizer) as
the periodic sketch nonlinearity. Trading for this resource-efficient signature
(standard in most acquisition schemes) has almost no impact on the clustering
performances, as illustrated by numerical experiments
Approximating Subadditive Hadamard Functions on Implicit Matrices
An important challenge in the streaming model is to maintain small-space
approximations of entrywise functions performed on a matrix that is generated
by the outer product of two vectors given as a stream. In other works, streams
typically define matrices in a standard way via a sequence of updates, as in
the work of Woodruff (2014) and others. We describe the matrix formed by the
outer product, and other matrices that do not fall into this category, as
implicit matrices. As such, we consider the general problem of computing over
such implicit matrices with Hadamard functions, which are functions applied
entrywise on a matrix. In this paper, we apply this generalization to provide
new techniques for identifying independence between two vectors in the
streaming model. The previous state of the art algorithm of Braverman and
Ostrovsky (2010) gave a -approximation for the distance
between the product and joint distributions, using space , where is the length of the stream and denotes the
size of the universe from which stream elements are drawn. Our general
techniques include the distance as a special case, and we give an
improved space bound of
Sketching for Large-Scale Learning of Mixture Models
Learning parameters from voluminous data can be prohibitive in terms of
memory and computational requirements. We propose a "compressive learning"
framework where we estimate model parameters from a sketch of the training
data. This sketch is a collection of generalized moments of the underlying
probability distribution of the data. It can be computed in a single pass on
the training set, and is easily computable on streams or distributed datasets.
The proposed framework shares similarities with compressive sensing, which aims
at drastically reducing the dimension of high-dimensional signals while
preserving the ability to reconstruct them. To perform the estimation task, we
derive an iterative algorithm analogous to sparse reconstruction algorithms in
the context of linear inverse problems. We exemplify our framework with the
compressive estimation of a Gaussian Mixture Model (GMM), providing heuristics
on the choice of the sketching procedure and theoretical guarantees of
reconstruction. We experimentally show on synthetic data that the proposed
algorithm yields results comparable to the classical Expectation-Maximization
(EM) technique while requiring significantly less memory and fewer computations
when the number of database elements is large. We further demonstrate the
potential of the approach on real large-scale data (over 10 8 training samples)
for the task of model-based speaker verification. Finally, we draw some
connections between the proposed framework and approximate Hilbert space
embedding of probability distributions using random features. We show that the
proposed sketching operator can be seen as an innovative method to design
translation-invariant kernels adapted to the analysis of GMMs. We also use this
theoretical framework to derive information preservation guarantees, in the
spirit of infinite-dimensional compressive sensing
- âŠ