168 research outputs found
Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms
We present a technical survey on the state of the art approaches in data reduction and the coreset framework. These include geometric decompositions, gradient methods, random sampling, sketching and random projections. We further outline their importance for the design of streaming algorithms and give a brief overview on lower bounding techniques
Optimality of the Johnson-Lindenstrauss Lemma
For any integers and , we show the existence of a set of vectors such that any embedding satisfying
must have This lower bound matches the upper bound given by the Johnson-Lindenstrauss
lemma [JL84]. Furthermore, our lower bound holds for nearly the full range of
of interest, since there is always an isometric embedding into
dimension (either the identity map, or projection onto
).
Previously such a lower bound was only known to hold against linear maps ,
and not for such a wide range of parameters [LN16]. The
best previously known lower bound for general was [Wel74, Lev83, Alo03], which
is suboptimal for any .Comment: v2: simplified proof, also added reference to Lev8
Dimensionality Reduction for k-Means Clustering and Low Rank Approximation
We show how to approximate a data matrix with a much smaller
sketch that can be used to solve a general class of
constrained k-rank approximation problems to within error.
Importantly, this class of problems includes -means clustering and
unconstrained low rank approximation (i.e. principal component analysis). By
reducing data points to just dimensions, our methods generically
accelerate any exact, approximate, or heuristic algorithm for these ubiquitous
problems.
For -means dimensionality reduction, we provide relative
error results for many common sketching techniques, including random row
projection, column selection, and approximate SVD. For approximate principal
component analysis, we give a simple alternative to known algorithms that has
applications in the streaming setting. Additionally, we extend recent work on
column-based matrix reconstruction, giving column subsets that not only `cover'
a good subspace for \bv{A}, but can be used directly to compute this
subspace.
Finally, for -means clustering, we show how to achieve a
approximation by Johnson-Lindenstrauss projecting data points to just dimensions. This gives the first result that leverages the
specific structure of -means to achieve dimension independent of input size
and sublinear in
Tracking the l_2 Norm with Constant Update Time
The l_2 tracking problem is the task of obtaining a streaming algorithm that, given access to a stream of items a_1,a_2,a_3,... from a universe [n], outputs at each time t an estimate to the l_2 norm of the frequency vector f^{(t)}in R^n (where f^{(t)}_i is the number of occurrences of item i in the stream up to time t). The previous work [Braverman-Chestnut-Ivkin-Nelson-Wang-Woodruff, PODS 2017] gave a streaming algorithm with (the optimal) space using O(epsilon^{-2}log(1/delta)) words and O(epsilon^{-2}log(1/delta)) update time to obtain an epsilon-accurate estimate with probability at least 1-delta. We give the first algorithm that achieves update time of O(log 1/delta) which is independent of the accuracy parameter epsilon, together with the nearly optimal space using O(epsilon^{-2}log(1/delta)) words. Our algorithm is obtained using the Count Sketch of [Charilkar-Chen-Farach-Colton, ICALP 2002]
- …