Search CORE

2 research outputs found

Subspace exploration: Bounds on Projected Frequency Estimation

Author: Cormode Graham
Dickens Charlie
Woodruff David P.
Publication venue
Publication date: 19/01/2021
Field of study

Given an

n \times d

dimensional dataset

A

, a projection query specifies a subset

C \subseteq [d]

of columns which yields a new

n \times |C|

array. We study the space complexity of computing data analysis functions over such subspaces, including heavy hitters and norms, when the subspaces are revealed only after observing the data. We show that this important class of problems is typically hard: for many problems, we show

2^{\Omega(d)}

lower bounds. However, we present upper bounds which demonstrate space dependency better than

2^d

. That is, for

c,c' \in (0,1)

and a parameter

N=2^d

N^c

-approximation can be obtained in space

\min(N^{c'},n)

, showing that it is possible to improve on the na\"{i}ve approach of keeping information for all

2^d

subsets of

d

columns. Our results are based on careful constructions of instances using coding theory and novel combinatorial reductions that exhibit such space-approximation tradeoffs

arXiv.org e-Print Archive

Sampling Sketches for Concave Sublinear Functions of Frequencies

Author: Cohen Edith
Geri Ofir
Publication venue
Publication date: 22/12/2019
Field of study

We consider massive distributed datasets that consist of elements modeled as key-value pairs and the task of computing statistics or aggregates where the contribution of each key is weighted by a function of its frequency (sum of values of its elements). This fundamental problem has a wealth of applications in data analytics and machine learning, in particular, with concave sublinear functions of the frequencies that mitigate the disproportionate effect of keys with high frequency. The family of concave sublinear functions includes low frequency moments (

p \leq 1

), capping, logarithms, and their compositions. A common approach is to sample keys, ideally, proportionally to their contributions and estimate statistics from the sample. A simple but costly way to do this is by aggregating the data to produce a table of keys and their frequencies, apply our function to the frequency values, and then apply a weighted sampling scheme. Our main contribution is the design of composable sampling sketches that can be tailored to any concave sublinear function of the frequencies. Our sketch structure size is very close to the desired sample size and our samples provide statistical guarantees on the estimation quality that are very close to that of an ideal sample of the same size computed over aggregated data. Finally, we demonstrate experimentally the simplicity and effectiveness of our methods.Comment: Full version of a NeurIPS 2019 pape

arXiv.org e-Print Archive