238 research outputs found
Towards Optimal Moment Estimation in Streaming and Distributed Models
One of the oldest problems in the data stream model is to approximate the p-th moment ||X||_p^p = sum_{i=1}^n X_i^p of an underlying non-negative vector X in R^n, which is presented as a sequence of poly(n) updates to its coordinates. Of particular interest is when p in (0,2]. Although a tight space bound of Theta(epsilon^-2 log n) bits is known for this problem when both positive and negative updates are allowed, surprisingly there is still a gap in the space complexity of this problem when all updates are positive. Specifically, the upper bound is O(epsilon^-2 log n) bits, while the lower bound is only Omega(epsilon^-2 + log n) bits. Recently, an upper bound of O~(epsilon^-2 + log n) bits was obtained under the assumption that the updates arrive in a random order.
We show that for p in (0, 1], the random order assumption is not needed. Namely, we give an upper bound for worst-case streams of O~(epsilon^-2 + log n) bits for estimating |X |_p^p. Our techniques also give new upper bounds for estimating the empirical entropy in a stream. On the other hand, we show that for p in (1,2], in the natural coordinator and blackboard distributed communication topologies, there is an O~(epsilon^-2) bit max-communication upper bound based on a randomized rounding scheme. Our protocols also give rise to protocols for heavy hitters and approximate matrix product. We generalize our results to arbitrary communication topologies G, obtaining an O~(epsilon^2 log d) max-communication upper bound, where d is the diameter of G. Interestingly, our upper bound rules out natural communication complexity-based approaches for proving an Omega(epsilon^-2 log n) bit lower bound for p in (1,2] for streaming algorithms. In particular, any such lower bound must come from a topology with large diameter
Streaming classification with emerging new class by class matrix sketching
National Research Foundation (NRF) Singapor
On Deterministic Sketching and Streaming for Sparse Recovery and Norm Estimation
We study classic streaming and sparse recovery problems using deterministic
linear sketches, including l1/l1 and linf/l1 sparse recovery problems (the
latter also being known as l1-heavy hitters), norm estimation, and approximate
inner product. We focus on devising a fixed matrix A in R^{m x n} and a
deterministic recovery/estimation procedure which work for all possible input
vectors simultaneously. Our results improve upon existing work, the following
being our main contributions:
* A proof that linf/l1 sparse recovery and inner product estimation are
equivalent, and that incoherent matrices can be used to solve both problems.
Our upper bound for the number of measurements is m=O(eps^{-2}*min{log n, (log
n / log(1/eps))^2}). We can also obtain fast sketching and recovery algorithms
by making use of the Fast Johnson-Lindenstrauss transform. Both our running
times and number of measurements improve upon previous work. We can also obtain
better error guarantees than previous work in terms of a smaller tail of the
input vector.
* A new lower bound for the number of linear measurements required to solve
l1/l1 sparse recovery. We show Omega(k/eps^2 + klog(n/k)/eps) measurements are
required to recover an x' with |x - x'|_1 <= (1+eps)|x_{tail(k)}|_1, where
x_{tail(k)} is x projected onto all but its largest k coordinates in magnitude.
* A tight bound of m = Theta(eps^{-2}log(eps^2 n)) on the number of
measurements required to solve deterministic norm estimation, i.e., to recover
|x|_2 +/- eps|x|_1.
For all the problems we study, tight bounds are already known for the
randomized complexity from previous work, except in the case of l1/l1 sparse
recovery, where a nearly tight bound is known. Our work thus aims to study the
deterministic complexities of these problems
Online Adaptive Mahalanobis Distance Estimation
Mahalanobis metrics are widely used in machine learning in conjunction with
methods like -nearest neighbors, -means clustering, and -medians
clustering. Despite their importance, there has not been any prior work on
applying sketching techniques to speed up algorithms for Mahalanobis metrics.
In this paper, we initiate the study of dimension reduction for Mahalanobis
metrics. In particular, we provide efficient data structures for solving the
Approximate Distance Estimation (ADE) problem for Mahalanobis distances. We
first provide a randomized Monte Carlo data structure. Then, we show how we can
adapt it to provide our main data structure which can handle sequences of
\textit{adaptive} queries and also online updates to both the Mahalanobis
metric matrix and the data points, making it amenable to be used in conjunction
with prior algorithms for online learning of Mahalanobis metrics
- …